H.3.8. Natural Language Processing
Milad Allahgholi; Hossein Rahmani; Parinaz Soltanzadeh
Abstract
Stance detection is the process of identifying and classifying an author's point of view or stance towards a specific target in a given text. Most of previous studies on stance detection neglect the contextual information hidden in the input data and as a result lead to less accurate results. In this ...
Read More
Stance detection is the process of identifying and classifying an author's point of view or stance towards a specific target in a given text. Most of previous studies on stance detection neglect the contextual information hidden in the input data and as a result lead to less accurate results. In this paper, we propose a novel method called ConSPro, which uses decoder-only transformers to consider contextual input data in the process of stance detection. First, ConSPro applies zero-shot prompting of decoder only transformers to extract the context of target in the input data. Second, in addition to target and input text, ConSPro uses the extracted context as the third type of parameter for the ensemble method. We evaluate ConSPro on SemEval2016 and the empirical results indicate that ConSPro outperforms the non-contextual approaches methods, on average 9% with respect to f-measure. The findings of this study show the strong capabilities of zero-shot prompting for extracting the informative contextual information with significantly less effort comparing to previous methods on context extraction.
H.3.8. Natural Language Processing
Milad Allhgholi; Hossein Rahmani; Amirhossein Derakhshan; Saman Mohammadi Raouf
Abstract
Document similarity matching is essential for efficient text retrieval, plagiarism detection, and content analysis. Existing studies in this field can be categorized into three approaches: statistical analysis, deep learning, and hybrid approaches. However, to the best of our knowledge, none have incorporated ...
Read More
Document similarity matching is essential for efficient text retrieval, plagiarism detection, and content analysis. Existing studies in this field can be categorized into three approaches: statistical analysis, deep learning, and hybrid approaches. However, to the best of our knowledge, none have incorporated the importance of named entities into their methodologies. In this paper, we propose DOSTE, a method that first extracts name entities and then utilizes them to enhance document similarity matching through statistical and graph-based analysis. Empirical results indicate that DOSTE achieves better results by emphasizing named entities, resulting in an average improvement of 9% in the average recall metric compared to baseline methods. Also, DOSTE unlike LLM-based approaches, does not require extensive GPU resources. Additionally, non-empirical interpretations of the results indicate that DOSTE is particularly effective in identifying similarity in short documents and complex document comparisons.
H.3.8. Natural Language Processing
Ali Reza Ghasemi; Javad Salimi Sartakhti
Abstract
This paper evaluates the performance of various fine-tuning methods in Persian natural language processing (NLP) tasks. In low-resource languages like Persian, which suffer from a lack of rich and sufficient data for training large models, it is crucial to select appropriate fine-tuning ...
Read More
This paper evaluates the performance of various fine-tuning methods in Persian natural language processing (NLP) tasks. In low-resource languages like Persian, which suffer from a lack of rich and sufficient data for training large models, it is crucial to select appropriate fine-tuning techniques that mitigate overfitting and prevent the model from learning weak or surface-level patterns. The main goal of this research is to compare the effectiveness of fine-tuning approaches such as Full-Finetune, LoRA, AdaLoRA, and DoRA on model learning and task performance. We apply these techniques to three different Persian NLP tasks: sentiment analysis, named entity recognition (NER), and span question answering (QA). For this purpose, we conduct experiments on three Transformer-based multilingual models with different architectures and parameter scales: BERT-base multilingual (~168M parameters) with Encoder only structure, mT5-small (~300M parameters) with Encoder-Decoder structure, and mGPT (~1.4B parameters) with Decoder only structure. Each of these models supports the Persian language but varies in structure and computational requirements, influencing the effectiveness of different fine-tuning approaches. Results indicate that fully fine-tuned BERT-base multilingual consistently outperforms other models across all tasks in basic metrics, particularly given the unique challenges of these embedding-based tasks. Additionally, lightweight fine-tuning methods like LoRA and DoRA offer very competitive performance while significantly reducing computational overhead and outperform other models in Performance-Efficiency Score introduced in the paper. This study contributes to a better understanding of fine-tuning methods, especially for Persian NLP, and offers practical guidance for applying Large Language Models (LLMs) to downstream tasks in low-resource languages.
H.3.8. Natural Language Processing
Nura Esfandiari; Kourosh Kiani; Razieh Rastgoo
Abstract
Chatbots are computer programs designed to simulate human conversation. Powered by artificial intelligence (AI), these chatbots are increasingly used to provide customer service, particularly by large language models (LLMs). A process known as fine-tuning LLMs is employed to personalize chatbot answers. ...
Read More
Chatbots are computer programs designed to simulate human conversation. Powered by artificial intelligence (AI), these chatbots are increasingly used to provide customer service, particularly by large language models (LLMs). A process known as fine-tuning LLMs is employed to personalize chatbot answers. This process demands substantial high-quality data and computational resources. In this article, to overcome the computational hurdles associated with fine-tuning LLMs, innovative hybrid approach is proposed. This approach aims to enhance the answers generated by LLMs, specifically for Persian chatbots used in mobile customer services. A transformer-based evaluation model was developed to score generated answers and select the most appropriate answers. Additionally, a Persian language dataset tailored to the domain of mobile sales was collected to support the personalization of the Persian chatbot and the training of the evaluation model. This approach is expected to foster increased customer interaction and boost sales within the Persian mobile phone market. Experiments conducted on four different LLMs demonstrated the effectiveness of the proposed approach in generating more relevant and semantically accurate answers for users.
H.3.8. Natural Language Processing
Alireza Mohammadi Gohar; Kambiz Rahbar; Behrouz Minaei-Bidgoli; Ziaeddin Beheshtifard
Abstract
Generative Adversarial Networks (GANs) have emerged as a pivotal research focus within artificial intelligence due to their exceptional capabilities in data generation. Their ability to produce high-quality synthetic data has garnered significant attention, leading to their application in diverse domains ...
Read More
Generative Adversarial Networks (GANs) have emerged as a pivotal research focus within artificial intelligence due to their exceptional capabilities in data generation. Their ability to produce high-quality synthetic data has garnered significant attention, leading to their application in diverse domains such as image and video generation, classification, and style transfer. Beyond these continuous data applications, GANs are also being leveraged for discrete data tasks, including text and music generation. The distinct nature of continuous and discrete data poses unique challenges for GANs. In particular, generating discrete values necessitates the use of Policy Gradient algorithms from reinforcement learning to avoid the direct back-propagation typically used for continuous values. The generator must map latent variables into discrete domains, and unlike continuous value generation, this process involves subtle adjustments to the generator’s outputs to progressively align with real discrete data, guided by the discriminator. This paper aims to provide a thorough review of GAN architectures, fundamental concepts, and applications in the context of discrete data. Additionally, it addresses the existing challenges, evaluation metrics, and future research directions in this burgeoning field.
H.3.8. Natural Language Processing
Davud Mohammadpur; Mehdi Nazari
Abstract
Text summarization has become one of the favorite subjects of researchers due to the rapid growth of contents. In title generation, a key aspect of text summarization, creating a concise and meaningful title is essential as it reflects the article's content, objectives, methodologies, and findings. Thus, ...
Read More
Text summarization has become one of the favorite subjects of researchers due to the rapid growth of contents. In title generation, a key aspect of text summarization, creating a concise and meaningful title is essential as it reflects the article's content, objectives, methodologies, and findings. Thus, generating an effective title requires a thorough understanding of the article. Various methods have been proposed in text summarization to automatically generate titles, utilizing machine learning and deep learning techniques to improve results. This study aims to develop a title generation system for scientific articles using transformer-based methods to create suitable titles from article abstracts. Pre-trained transformer-based models like BERT, T5, and PEGASUS are optimized for constructing complete sentences, but their ability to generate scientific titles is limited. We have attempted to improve this limitation by presenting a proposed method that combines different models along with a suitable dataset for training. To create our desired dataset, we collected abstracts and titles of articles published on the ScienceDirect.com website. After performing preprocessing on this data, we developed a suitable dataset consisting of 50,000 articles. The results from the evaluations of the proposed method indicate more than 20% improvement based on various ROUGE metrics in the generation of scientific titles. Additionally, an examination of the results by experts in each scientific field revealed that the generated titles are also acceptable to these specialists.
H.3.8. Natural Language Processing
Nura Esfandiari; Kourosh Kiani; Razieh Rastgoo
Abstract
A chatbot is a computer program system designed to simulate human-like conversations and interact with users. It is a form of conversational agent that utilizes Natural Language Processing (NLP) and sequential models to understand user input, interpret their intent, and generate appropriate answer. This ...
Read More
A chatbot is a computer program system designed to simulate human-like conversations and interact with users. It is a form of conversational agent that utilizes Natural Language Processing (NLP) and sequential models to understand user input, interpret their intent, and generate appropriate answer. This approach aims to generate word sequences in the form of coherent phrases. A notable challenge associated with previous models lies in their sequential training process, which can result in less accurate outcomes. To address this limitation, a novel generative chatbot is proposed, integrating the power of Reinforcement Learning (RL) and transformer models. The proposed chatbot aims to overcome the challenges associated with sequential training by combining these two approaches. The proposed approach employs a Double Deep Q-Network (DDQN) architecture with utilizing a transformer model as the agent. This agent takes the human question as an input state and generates the bot answer as an action. To the best of our knowledge, this is the first time that a generative chatbot is proposed using a DDQN architecture with the embedded transformer as an agent. Results on two public datasets, Daily Dialog and Chit-Chat, validate the superiority of the proposed approach over state-of-the-art models involves employing various evaluation metrics.
H.3.8. Natural Language Processing
P. Kavehzadeh; M. M. Abdollah Pour; S. Momtazi
Abstract
Over the last few years, text chunking has taken a significant part in sequence labeling tasks. Although a large variety of methods have been proposed for shallow parsing in English, most proposed approaches for text chunking in Persian language are based on simple and traditional concepts. In this paper, ...
Read More
Over the last few years, text chunking has taken a significant part in sequence labeling tasks. Although a large variety of methods have been proposed for shallow parsing in English, most proposed approaches for text chunking in Persian language are based on simple and traditional concepts. In this paper, we propose using the state-of-the-art transformer-based contextualized models, namely BERT and XLM-RoBERTa, as the major structure of our models. Conditional Random Field (CRF), the combination of Bidirectional Long Short-Term Memory (BiLSTM) and CRF, and a simple dense layer are employed after the transformer-based models to enhance the model's performance in predicting chunk labels. Moreover, we provide a new dataset for noun phrase chunking in Persian which includes annotated data of Persian news text. Our experiments reveal that XLM-RoBERTa achieves the best performance between all the architectures tried on the proposed dataset. The results also show that using a single CRF layer would yield better results than a dense layer and even the combination of BiLSTM and CRF.
H.3.8. Natural Language Processing
L. Jafar Tafreshi; F. Soltanzadeh
Abstract
Named Entity Recognition is an information extraction technique that identifies name entities in a text. Three popular methods have been conventionally used namely: rule-based, machine-learning-based and hybrid of them to extract named entities from a text. Machine-learning-based methods have good performance ...
Read More
Named Entity Recognition is an information extraction technique that identifies name entities in a text. Three popular methods have been conventionally used namely: rule-based, machine-learning-based and hybrid of them to extract named entities from a text. Machine-learning-based methods have good performance in the Persian language if they are trained with good features. To get good performance in Conditional Random Field-based Persian Named Entity Recognition, a several syntactic features based on dependency grammar along with some morphological and language-independent features have been designed in order to extract suitable features for the learning phase. In this implementation, designed features have been applied to Conditional Random Field to build our model. To evaluate our system, the Persian syntactic dependency Treebank with about 30,000 sentences, prepared in NOOR Islamic science computer research center, has been implemented. This Treebank has Named-Entity tags, such as Person, Organization and location. The result of this study showed that our approach achieved 86.86% precision, 80.29% recall and 83.44% F-measure which are relatively higher than those values reported for other Persian NER methods.
H.3.8. Natural Language Processing
S. Lazemi; H. Ebrahimpour-komleh
Abstract
Dependency parser is one of the most important fundamental tools in the natural language processing, which extracts structure of sentences and determines the relations between words based on the dependency grammar. The dependency parser is proper for free order languages, such as Persian. In this paper, ...
Read More
Dependency parser is one of the most important fundamental tools in the natural language processing, which extracts structure of sentences and determines the relations between words based on the dependency grammar. The dependency parser is proper for free order languages, such as Persian. In this paper, data-driven dependency parser has been developed with the help of phrase-structure parser for Persian. The defined feature space in each parser is one of the important factors in its success. Our goal is to generate and extract appropriate features to dependency parsing of Persian sentences. To achieve this goal, new semantic and syntactic features have been defined and added to the MSTParser by stacking method. Semantic features are obtained by using word clustering algorithms based on syntagmatic analysis and syntactic features are obtained by using the Persian phrase-structure parser and have been used as bit-string. Experiments have been done on the Persian Dependency Treebank (PerDT) and the Uppsala Persian Dependency Treebank (UPDT). The results indicate that the definition of new features improves the performance of the dependency parser for the Persian. The achieved unlabeled attachment score for PerDT and UPDT are 89.17% and 88.96% respectively.
H.3.8. Natural Language Processing
A. Akkasi; E. Varoglu
Abstract
Chemical Named Entity Recognition (NER) is the basic step for consequent information extraction tasks such as named entity resolution, drug-drug interaction discovery, extraction of the names of the molecules and their properties. Improvement in the performance of such systems may affects the quality ...
Read More
Chemical Named Entity Recognition (NER) is the basic step for consequent information extraction tasks such as named entity resolution, drug-drug interaction discovery, extraction of the names of the molecules and their properties. Improvement in the performance of such systems may affects the quality of the subsequent tasks. Chemical text from which data for named entity recognition is extracted is naturally imbalanced since chemical entities are fewer compared to other segments in text. In this paper, the class imbalance problem in the context of chemical named entity recognition has been studied and adopted version of random undersampling for NER data, has been leveraged to generate a pool of classifiers. In order to keep the classes’ distribution balanced within each sentence, the well-known random undersampling method is modified to a sentence based version where the random removal of samples takes place within each sentence instead of considering the dataset as a whole. Furthermore, to take the advantages of combination of a set of diverse predictors, an ensemble of classifiers trained with the set of different training data resulted by sentence-based undersampling, is created. The proposed approach is developed and tested using the ChemDNER corpus released by BioCreative IV. Results show that the proposed method improves the classification performance of the baseline classifiers mainly as a result of an increase in recall. Furthermore, the combination of high performing classifiers trained using undersampled train data surpasses the performance of all single best classifiers and the combination of classifiers using full data.
H.3.8. Natural Language Processing
B. Bokharaeian; A. Diaz
Abstract
Extracting biomedical relations such as drug-drug interaction (DDI) from text is an important task in biomedical NLP. Due to the large number of complex sentences in biomedical literature, researchers have employed some sentence simplification techniques to improve the performance of the relation extraction ...
Read More
Extracting biomedical relations such as drug-drug interaction (DDI) from text is an important task in biomedical NLP. Due to the large number of complex sentences in biomedical literature, researchers have employed some sentence simplification techniques to improve the performance of the relation extraction methods. However, due to difficulty of the task, there is no noteworthy improvement in the research literature. This paper aims to explore clause dependency related features alongside to linguistic-based negation scope and cues to overcome complexity of the sentences. The results show by employing the proposed features combined with a bag of words kernel, the performance of the used kernel methods improves. Moreover, experiments show the enhanced local context kernel outperforms other methods. The proposed method can be used as an alternative approach for sentence simplification techniques in biomedical area which is an error-prone task.
H.3.8. Natural Language Processing
A. Pakzad; B. Minaei Bidgoli
Abstract
Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do ...
Read More
Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipeline models, a tagging error propagates, but the model is not able to apply useful syntactic information. The goal of joint models simultaneously reduce errors of POS tagging and dependency parsing tasks. In this research, we attempted to utilize the joint model on the Persian and English language using Corbit software. We optimized the model's features and improved its accuracy concurrently. Corbit software is an implementation of a transition-based approach for word segmentation, POS tagging and dependency parsing. In this research, the joint accuracy of POS tagging and dependency parsing over the test data on Persian, reached 85.59% for coarse-grained and 84.24% for fine-grained POS. Also, we attained 76.01% for coarse-grained and 74.34% for fine-grained POS on English.
H.3.8. Natural Language Processing
A. Khazaei; M. Ghasemzadeh
Abstract
This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of ...
Read More
This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of documents based on their content, it is expected that the answer to this question is yes. On the other hand, many differences between various languages can cause the answer to this question to be no. This research has focused on k-means that is one of the basic and popular document clustering methods. We want to know whether the clusters of aligned Persian and English texts obtained by the k-means are similar. To find an answer to this question, Mizan English-Persian Parallel Corpus was considered as benchmark. After features extraction using text mining techniques and applying the PCA dimension reduction method, the k-means clustering was performed. The morphological difference between English and Persian languages caused the larger feature vector length for Persian. So almost in all experiments, the English results were slightly richer than those in Persian. Aside from these differences, the overall behavior of Persian and English clusters was similar. These similar behaviors showed that results of k-means research on English can be expanded to Persian. Finally, there is hope that despite many differences between various languages, clustering methods may be extendable to other languages.