H.3.8. Natural Language Processing
Hassan Deldar; Mohammad Mehdi Homayounpour
Abstract
In most of the countries, the legislative process has a long history, which has led to increasing diversity and multiplicity of laws. This has made it difficult to access laws that are valid in both time and place. The focus of this article is on the application of artificial intelligence in the domain ...
Read More
In most of the countries, the legislative process has a long history, which has led to increasing diversity and multiplicity of laws. This has made it difficult to access laws that are valid in both time and place. The focus of this article is on the application of artificial intelligence in the domain of legal statutes to assist in identifying the need for amendments to laws or specific provisions. The general framework of the proposed process consists of two key components.First, the texts of legal clauses or articles are enriched through the generation of enriched data using large language models, which involves producing embedding vectors, thematic classification,and extracting the provisions of each law. Second, a retrieval-augmented text generation (RAG) system is developed with the aid of large language models to determine conflicts or the need for expurgation in the output, utilizing the enriched data, predefined prompts, and the Chain of Thought (CoT) technique.The proposed method was evaluated on two benchmark datasets.On the COLIEE 2025 dataset, our approach outperformed the 2024 winners in legal implication tasks, achieving an F1 score of 0.6521 with minimal prompting. The second evaluation used over 1,000 legal clauses covering abrogation and neutral rules, yielding an impressive F1 score exceeding 73.41%.The findings of the proposed methodology demonstrate that, even with limited expertise in the legal domain, it is possible to identify conflicts and the necessity for refining legal texts to an acceptable degree within a reasonable timeframe for legal experts, leveraging the capabilities of large language models.
H.3.8. Natural Language Processing
Saedeh Tahery; Saeed Farzi
Abstract
Dialogue understanding for low-resource languages like Persian remains challenging due to limited annotated data, which constrains supervised training at scale. We propose a simple yet effective training-free method that combines machine translation, retrieval-based example selection, and prompting with ...
Read More
Dialogue understanding for low-resource languages like Persian remains challenging due to limited annotated data, which constrains supervised training at scale. We propose a simple yet effective training-free method that combines machine translation, retrieval-based example selection, and prompting with a large language model (GPT-4o) to improve zero-shot cross-lingual performance. Given a Persian utterance translated into English, our method retrieves semantically and lexically similar English examples using a hybrid similarity function, translates them back into Persian, and constructs a few-shot prompt tailored to the input. This input-sensitive strategy enhances the quality of the examples, helping the model align more effectively with each instance. Experimental results on the Persian-ATIS dataset show that our approach improves intent detection and achieves competitive slot filling performance, outperforming state-of-the-art baselines without requiring any supervision in the target language. The modular pipeline is easy to reproduce and, in future work, can be extended to other low-resource languages, tasks, or retrieval configurations. The repository of our work is available at https://anonymous.4open.science/r/Persian_Language_Understanding-FDF4.
H.3.8. Natural Language Processing
Arash Keshtkar; Saeedeh Sadat Sadidpour; Hossien Shirazi
Abstract
Word Sense Disambiguation (WSD) is a longstanding challenge in natural language processing, particularly in morphologically rich and low-resource languages such as Persian. The inherent ambiguity of Persian named entities exacerbated by domain-specific contexts and limited labeled data complicates both ...
Read More
Word Sense Disambiguation (WSD) is a longstanding challenge in natural language processing, particularly in morphologically rich and low-resource languages such as Persian. The inherent ambiguity of Persian named entities exacerbated by domain-specific contexts and limited labeled data complicates both semantic interpretation and information extraction. In this study, we introduce the PWNC corpus, a large-scale, integrated dataset designed for both Named Entity Recognition (NER) and WSD in Persian. The corpus was automatically constructed through a semi-supervised framework, incorporating contextual similarity measures and clustering algorithms to annotate ambiguous entities across ten semantic categories. Utilizing a semi-supervised framework, the proposed homograph semantic categorization method achieved robust performance, with a precision of 83%, recall of 81%, and an F1-score of 82% across over 305K annotated paragraphs. Detailed error analysis revealed challenges in disambiguating closely related senses and weak entities, which were mitigated through contextual embedding strategies. This work provides the first publicly available dual-task corpus for Persian NER and WSD, offering a scalable solution for disambiguation in low-resource tasks and laying the baseline for future research in Persian semantic processing.
H.3.8. Natural Language Processing
Mohammad Hadi Goldani; Saeedeh Momtazi; Reza Safabakhsh
Abstract
The widespread use of web-based forums and social media has led to an increase in news consumption. To mitigate the impact of misinformation on users' health-related decisions, it is crucial to develop machine learning models that can automatically detect and combat fake news. In this paper, we propose ...
Read More
The widespread use of web-based forums and social media has led to an increase in news consumption. To mitigate the impact of misinformation on users' health-related decisions, it is crucial to develop machine learning models that can automatically detect and combat fake news. In this paper, we propose a novel multilingual model with dynamic transformer model called Hybrid CapsNet for Covid-19 fake news detection in English and Persian languages. Our model incorporates two dynamic pre-trained representation models that incrementally uptrain and update the word embeddings in the training phase., dynamic RoBERTa for English and dynamic ParsBERT for Persian, and two parallel classifiers with new loss function namely margin loss. By utilizing dynamic transformer and both Deep Convolutional Neural Networks (DCNN) and Capsule Neural Networks (CapsNet), we achieve better performance than state-of-the-art baselines. To evaluate the proposed model, we use two recent Covid-19 datasets in English and Persian. Our results, in terms of F1-score, demonstrate the effectiveness of the Hybrid CapsNet model. Our model outperforms existing baselines, suggesting that it can be an effective tool for detecting and combating fake news related to Covid-19 in multiple languages. Overall, our study highlights the importance of developing effective machine learning models for combating misinformation during critical events such as the Covid-19 pandemic. The proposed model has the potential to be applied to other languages and domains and can be a valuable tool for protecting public health and safety.
H.3.8. Natural Language Processing
Mozhgan Akaberi; Maryam Khodabakhsh; Seyedehfatemeh Karimi; Hoda Mashayekhi
Abstract
The exponential growth of digital information has increased the demand for robust and efficient Information Retrieval (IR) systems. Query Performance Prediction (QPP) is a critical task for identifying difficult queries and enhancing retrieval strategies. However, existing QPP methods suffer from several ...
Read More
The exponential growth of digital information has increased the demand for robust and efficient Information Retrieval (IR) systems. Query Performance Prediction (QPP) is a critical task for identifying difficult queries and enhancing retrieval strategies. However, existing QPP methods suffer from several limitations: (1) score-based approaches fail to capture the structural relationships among retrieved documents, (2) supervised methods require labeled training data, making them costly and impractical for new domains, and (3) unsupervised post-retrieval predictors often rely solely on retrieval score dispersion, neglecting document clustering effects. To address these challenges, we propose a novel clustering-based post-retrieval QPP method. Specifically, we introduce three unsupervised predictors: Clustered Distinction, which measures query-specific separability of retrieved clusters; Clustered Query Drift, which estimates the deviation of top-ranked documents from query intent; and a hybrid approach combining both. By analyzing the clustering structure of retrieved documents, our method improves interpretability while eliminating the need for labeled data. We evaluate our approach on three standard datasets: the large-scale MS MARCO Passage Ranking dataset, TREC DL 2019, and TREC DL 2020. Experimental results demonstrate that our method significantly outperforms state-of-the-art score-based QPP models. These findings highlight the potential of cluster-aware QPP for enhancing IR systems and reducing the impact of difficult queries.
H.3.8. Natural Language Processing
Milad Allahgholi; Hossein Rahmani; Parinaz Soltanzadeh
Abstract
Stance detection is the process of identifying and classifying an author's point of view or stance towards a specific target in a given text. Most of previous studies on stance detection neglect the contextual information hidden in the input data and as a result lead to less accurate results. In this ...
Read More
Stance detection is the process of identifying and classifying an author's point of view or stance towards a specific target in a given text. Most of previous studies on stance detection neglect the contextual information hidden in the input data and as a result lead to less accurate results. In this paper, we propose a novel method called ConSPro, which uses decoder-only transformers to consider contextual input data in the process of stance detection. First, ConSPro applies zero-shot prompting of decoder only transformers to extract the context of target in the input data. Second, in addition to target and input text, ConSPro uses the extracted context as the third type of parameter for the ensemble method. We evaluate ConSPro on SemEval2016 and the empirical results indicate that ConSPro outperforms the non-contextual approaches methods, on average 9% with respect to f-measure. The findings of this study show the strong capabilities of zero-shot prompting for extracting the informative contextual information with significantly less effort comparing to previous methods on context extraction.
H.3.8. Natural Language Processing
Milad Allhgholi; Hossein Rahmani; Amirhossein Derakhshan; Saman Mohammadi Raouf
Abstract
Document similarity matching is essential for efficient text retrieval, plagiarism detection, and content analysis. Existing studies in this field can be categorized into three approaches: statistical analysis, deep learning, and hybrid approaches. However, to the best of our knowledge, none have incorporated ...
Read More
Document similarity matching is essential for efficient text retrieval, plagiarism detection, and content analysis. Existing studies in this field can be categorized into three approaches: statistical analysis, deep learning, and hybrid approaches. However, to the best of our knowledge, none have incorporated the importance of named entities into their methodologies. In this paper, we propose DOSTE, a method that first extracts name entities and then utilizes them to enhance document similarity matching through statistical and graph-based analysis. Empirical results indicate that DOSTE achieves better results by emphasizing named entities, resulting in an average improvement of 9% in the average recall metric compared to baseline methods. Also, DOSTE unlike LLM-based approaches, does not require extensive GPU resources. Additionally, non-empirical interpretations of the results indicate that DOSTE is particularly effective in identifying similarity in short documents and complex document comparisons.
H.3.8. Natural Language Processing
Ali Reza Ghasemi; Javad Salimi Sartakhti
Abstract
This paper evaluates the performance of various fine-tuning methods in Persian natural language processing (NLP) tasks. In low-resource languages like Persian, which suffer from a lack of rich and sufficient data for training large models, it is crucial to select appropriate fine-tuning ...
Read More
This paper evaluates the performance of various fine-tuning methods in Persian natural language processing (NLP) tasks. In low-resource languages like Persian, which suffer from a lack of rich and sufficient data for training large models, it is crucial to select appropriate fine-tuning techniques that mitigate overfitting and prevent the model from learning weak or surface-level patterns. The main goal of this research is to compare the effectiveness of fine-tuning approaches such as Full-Finetune, LoRA, AdaLoRA, and DoRA on model learning and task performance. We apply these techniques to three different Persian NLP tasks: sentiment analysis, named entity recognition (NER), and span question answering (QA). For this purpose, we conduct experiments on three Transformer-based multilingual models with different architectures and parameter scales: BERT-base multilingual (~168M parameters) with Encoder only structure, mT5-small (~300M parameters) with Encoder-Decoder structure, and mGPT (~1.4B parameters) with Decoder only structure. Each of these models supports the Persian language but varies in structure and computational requirements, influencing the effectiveness of different fine-tuning approaches. Results indicate that fully fine-tuned BERT-base multilingual consistently outperforms other models across all tasks in basic metrics, particularly given the unique challenges of these embedding-based tasks. Additionally, lightweight fine-tuning methods like LoRA and DoRA offer very competitive performance while significantly reducing computational overhead and outperform other models in Performance-Efficiency Score introduced in the paper. This study contributes to a better understanding of fine-tuning methods, especially for Persian NLP, and offers practical guidance for applying Large Language Models (LLMs) to downstream tasks in low-resource languages.
H.3.8. Natural Language Processing
Nura Esfandiari; Kourosh Kiani; Razieh Rastgoo
Abstract
Chatbots are computer programs designed to simulate human conversation. Powered by artificial intelligence (AI), these chatbots are increasingly used to provide customer service, particularly by large language models (LLMs). A process known as fine-tuning LLMs is employed to personalize chatbot answers. ...
Read More
Chatbots are computer programs designed to simulate human conversation. Powered by artificial intelligence (AI), these chatbots are increasingly used to provide customer service, particularly by large language models (LLMs). A process known as fine-tuning LLMs is employed to personalize chatbot answers. This process demands substantial high-quality data and computational resources. In this article, to overcome the computational hurdles associated with fine-tuning LLMs, innovative hybrid approach is proposed. This approach aims to enhance the answers generated by LLMs, specifically for Persian chatbots used in mobile customer services. A transformer-based evaluation model was developed to score generated answers and select the most appropriate answers. Additionally, a Persian language dataset tailored to the domain of mobile sales was collected to support the personalization of the Persian chatbot and the training of the evaluation model. This approach is expected to foster increased customer interaction and boost sales within the Persian mobile phone market. Experiments conducted on four different LLMs demonstrated the effectiveness of the proposed approach in generating more relevant and semantically accurate answers for users.
H.3.8. Natural Language Processing
Alireza Mohammadi Gohar; Kambiz Rahbar; Behrouz Minaei-Bidgoli; Ziaeddin Beheshtifard
Abstract
Generative Adversarial Networks (GANs) have emerged as a pivotal research focus within artificial intelligence due to their exceptional capabilities in data generation. Their ability to produce high-quality synthetic data has garnered significant attention, leading to their application in diverse domains ...
Read More
Generative Adversarial Networks (GANs) have emerged as a pivotal research focus within artificial intelligence due to their exceptional capabilities in data generation. Their ability to produce high-quality synthetic data has garnered significant attention, leading to their application in diverse domains such as image and video generation, classification, and style transfer. Beyond these continuous data applications, GANs are also being leveraged for discrete data tasks, including text and music generation. The distinct nature of continuous and discrete data poses unique challenges for GANs. In particular, generating discrete values necessitates the use of Policy Gradient algorithms from reinforcement learning to avoid the direct back-propagation typically used for continuous values. The generator must map latent variables into discrete domains, and unlike continuous value generation, this process involves subtle adjustments to the generator’s outputs to progressively align with real discrete data, guided by the discriminator. This paper aims to provide a thorough review of GAN architectures, fundamental concepts, and applications in the context of discrete data. Additionally, it addresses the existing challenges, evaluation metrics, and future research directions in this burgeoning field.
H.3.8. Natural Language Processing
Davud Mohammadpur; Mehdi Nazari
Abstract
Text summarization has become one of the favorite subjects of researchers due to the rapid growth of contents. In title generation, a key aspect of text summarization, creating a concise and meaningful title is essential as it reflects the article's content, objectives, methodologies, and findings. Thus, ...
Read More
Text summarization has become one of the favorite subjects of researchers due to the rapid growth of contents. In title generation, a key aspect of text summarization, creating a concise and meaningful title is essential as it reflects the article's content, objectives, methodologies, and findings. Thus, generating an effective title requires a thorough understanding of the article. Various methods have been proposed in text summarization to automatically generate titles, utilizing machine learning and deep learning techniques to improve results. This study aims to develop a title generation system for scientific articles using transformer-based methods to create suitable titles from article abstracts. Pre-trained transformer-based models like BERT, T5, and PEGASUS are optimized for constructing complete sentences, but their ability to generate scientific titles is limited. We have attempted to improve this limitation by presenting a proposed method that combines different models along with a suitable dataset for training. To create our desired dataset, we collected abstracts and titles of articles published on the ScienceDirect.com website. After performing preprocessing on this data, we developed a suitable dataset consisting of 50,000 articles. The results from the evaluations of the proposed method indicate more than 20% improvement based on various ROUGE metrics in the generation of scientific titles. Additionally, an examination of the results by experts in each scientific field revealed that the generated titles are also acceptable to these specialists.
H.3.8. Natural Language Processing
Nura Esfandiari; Kourosh Kiani; Razieh Rastgoo
Abstract
A chatbot is a computer program system designed to simulate human-like conversations and interact with users. It is a form of conversational agent that utilizes Natural Language Processing (NLP) and sequential models to understand user input, interpret their intent, and generate appropriate answer. This ...
Read More
A chatbot is a computer program system designed to simulate human-like conversations and interact with users. It is a form of conversational agent that utilizes Natural Language Processing (NLP) and sequential models to understand user input, interpret their intent, and generate appropriate answer. This approach aims to generate word sequences in the form of coherent phrases. A notable challenge associated with previous models lies in their sequential training process, which can result in less accurate outcomes. To address this limitation, a novel generative chatbot is proposed, integrating the power of Reinforcement Learning (RL) and transformer models. The proposed chatbot aims to overcome the challenges associated with sequential training by combining these two approaches. The proposed approach employs a Double Deep Q-Network (DDQN) architecture with utilizing a transformer model as the agent. This agent takes the human question as an input state and generates the bot answer as an action. To the best of our knowledge, this is the first time that a generative chatbot is proposed using a DDQN architecture with the embedded transformer as an agent. Results on two public datasets, Daily Dialog and Chit-Chat, validate the superiority of the proposed approach over state-of-the-art models involves employing various evaluation metrics.
H.3.8. Natural Language Processing
P. Kavehzadeh; M. M. Abdollah Pour; S. Momtazi
Abstract
Over the last few years, text chunking has taken a significant part in sequence labeling tasks. Although a large variety of methods have been proposed for shallow parsing in English, most proposed approaches for text chunking in Persian language are based on simple and traditional concepts. In this paper, ...
Read More
Over the last few years, text chunking has taken a significant part in sequence labeling tasks. Although a large variety of methods have been proposed for shallow parsing in English, most proposed approaches for text chunking in Persian language are based on simple and traditional concepts. In this paper, we propose using the state-of-the-art transformer-based contextualized models, namely BERT and XLM-RoBERTa, as the major structure of our models. Conditional Random Field (CRF), the combination of Bidirectional Long Short-Term Memory (BiLSTM) and CRF, and a simple dense layer are employed after the transformer-based models to enhance the model's performance in predicting chunk labels. Moreover, we provide a new dataset for noun phrase chunking in Persian which includes annotated data of Persian news text. Our experiments reveal that XLM-RoBERTa achieves the best performance between all the architectures tried on the proposed dataset. The results also show that using a single CRF layer would yield better results than a dense layer and even the combination of BiLSTM and CRF.
H.3.8. Natural Language Processing
L. Jafar Tafreshi; F. Soltanzadeh
Abstract
Named Entity Recognition is an information extraction technique that identifies name entities in a text. Three popular methods have been conventionally used namely: rule-based, machine-learning-based and hybrid of them to extract named entities from a text. Machine-learning-based methods have good performance ...
Read More
Named Entity Recognition is an information extraction technique that identifies name entities in a text. Three popular methods have been conventionally used namely: rule-based, machine-learning-based and hybrid of them to extract named entities from a text. Machine-learning-based methods have good performance in the Persian language if they are trained with good features. To get good performance in Conditional Random Field-based Persian Named Entity Recognition, a several syntactic features based on dependency grammar along with some morphological and language-independent features have been designed in order to extract suitable features for the learning phase. In this implementation, designed features have been applied to Conditional Random Field to build our model. To evaluate our system, the Persian syntactic dependency Treebank with about 30,000 sentences, prepared in NOOR Islamic science computer research center, has been implemented. This Treebank has Named-Entity tags, such as Person, Organization and location. The result of this study showed that our approach achieved 86.86% precision, 80.29% recall and 83.44% F-measure which are relatively higher than those values reported for other Persian NER methods.
H.3.8. Natural Language Processing
S. Lazemi; H. Ebrahimpour-komleh
Abstract
Dependency parser is one of the most important fundamental tools in the natural language processing, which extracts structure of sentences and determines the relations between words based on the dependency grammar. The dependency parser is proper for free order languages, such as Persian. In this paper, ...
Read More
Dependency parser is one of the most important fundamental tools in the natural language processing, which extracts structure of sentences and determines the relations between words based on the dependency grammar. The dependency parser is proper for free order languages, such as Persian. In this paper, data-driven dependency parser has been developed with the help of phrase-structure parser for Persian. The defined feature space in each parser is one of the important factors in its success. Our goal is to generate and extract appropriate features to dependency parsing of Persian sentences. To achieve this goal, new semantic and syntactic features have been defined and added to the MSTParser by stacking method. Semantic features are obtained by using word clustering algorithms based on syntagmatic analysis and syntactic features are obtained by using the Persian phrase-structure parser and have been used as bit-string. Experiments have been done on the Persian Dependency Treebank (PerDT) and the Uppsala Persian Dependency Treebank (UPDT). The results indicate that the definition of new features improves the performance of the dependency parser for the Persian. The achieved unlabeled attachment score for PerDT and UPDT are 89.17% and 88.96% respectively.
H.3.8. Natural Language Processing
A. Akkasi; E. Varoglu
Abstract
Chemical Named Entity Recognition (NER) is the basic step for consequent information extraction tasks such as named entity resolution, drug-drug interaction discovery, extraction of the names of the molecules and their properties. Improvement in the performance of such systems may affects the quality ...
Read More
Chemical Named Entity Recognition (NER) is the basic step for consequent information extraction tasks such as named entity resolution, drug-drug interaction discovery, extraction of the names of the molecules and their properties. Improvement in the performance of such systems may affects the quality of the subsequent tasks. Chemical text from which data for named entity recognition is extracted is naturally imbalanced since chemical entities are fewer compared to other segments in text. In this paper, the class imbalance problem in the context of chemical named entity recognition has been studied and adopted version of random undersampling for NER data, has been leveraged to generate a pool of classifiers. In order to keep the classes’ distribution balanced within each sentence, the well-known random undersampling method is modified to a sentence based version where the random removal of samples takes place within each sentence instead of considering the dataset as a whole. Furthermore, to take the advantages of combination of a set of diverse predictors, an ensemble of classifiers trained with the set of different training data resulted by sentence-based undersampling, is created. The proposed approach is developed and tested using the ChemDNER corpus released by BioCreative IV. Results show that the proposed method improves the classification performance of the baseline classifiers mainly as a result of an increase in recall. Furthermore, the combination of high performing classifiers trained using undersampled train data surpasses the performance of all single best classifiers and the combination of classifiers using full data.
H.3.8. Natural Language Processing
B. Bokharaeian; A. Diaz
Abstract
Extracting biomedical relations such as drug-drug interaction (DDI) from text is an important task in biomedical NLP. Due to the large number of complex sentences in biomedical literature, researchers have employed some sentence simplification techniques to improve the performance of the relation extraction ...
Read More
Extracting biomedical relations such as drug-drug interaction (DDI) from text is an important task in biomedical NLP. Due to the large number of complex sentences in biomedical literature, researchers have employed some sentence simplification techniques to improve the performance of the relation extraction methods. However, due to difficulty of the task, there is no noteworthy improvement in the research literature. This paper aims to explore clause dependency related features alongside to linguistic-based negation scope and cues to overcome complexity of the sentences. The results show by employing the proposed features combined with a bag of words kernel, the performance of the used kernel methods improves. Moreover, experiments show the enhanced local context kernel outperforms other methods. The proposed method can be used as an alternative approach for sentence simplification techniques in biomedical area which is an error-prone task.
H.3.8. Natural Language Processing
A. Pakzad; B. Minaei Bidgoli
Abstract
Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do ...
Read More
Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipeline models, a tagging error propagates, but the model is not able to apply useful syntactic information. The goal of joint models simultaneously reduce errors of POS tagging and dependency parsing tasks. In this research, we attempted to utilize the joint model on the Persian and English language using Corbit software. We optimized the model's features and improved its accuracy concurrently. Corbit software is an implementation of a transition-based approach for word segmentation, POS tagging and dependency parsing. In this research, the joint accuracy of POS tagging and dependency parsing over the test data on Persian, reached 85.59% for coarse-grained and 84.24% for fine-grained POS. Also, we attained 76.01% for coarse-grained and 74.34% for fine-grained POS on English.
H.3.8. Natural Language Processing
A. Khazaei; M. Ghasemzadeh
Abstract
This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of ...
Read More
This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of documents based on their content, it is expected that the answer to this question is yes. On the other hand, many differences between various languages can cause the answer to this question to be no. This research has focused on k-means that is one of the basic and popular document clustering methods. We want to know whether the clusters of aligned Persian and English texts obtained by the k-means are similar. To find an answer to this question, Mizan English-Persian Parallel Corpus was considered as benchmark. After features extraction using text mining techniques and applying the PCA dimension reduction method, the k-means clustering was performed. The morphological difference between English and Persian languages caused the larger feature vector length for Persian. So almost in all experiments, the English results were slightly richer than those in Persian. Aside from these differences, the overall behavior of Persian and English clusters was similar. These similar behaviors showed that results of k-means research on English can be expanded to Persian. Finally, there is hope that despite many differences between various languages, clustering methods may be extendable to other languages.