H.3. Artificial Intelligence
Farid Ariai; Maryam Tayefeh Mahmoudi; Ali Moeini
Abstract
In the era of pervasive internet use and the dominance of social networks, researchers face significant challenges in Persian text mining, including the scarcity of adequate datasets in Persian and the inefficiency of existing language models. This paper specifically tackles these challenges, aiming ...
Read More
In the era of pervasive internet use and the dominance of social networks, researchers face significant challenges in Persian text mining, including the scarcity of adequate datasets in Persian and the inefficiency of existing language models. This paper specifically tackles these challenges, aiming to amplify the efficiency of language models tailored to the Persian language. Focusing on enhancing the effectiveness of sentiment analysis, our approach employs an aspect-based methodology utilizing the ParsBERT model, augmented with a relevant lexicon. The study centers on sentiment analysis of user opinions extracted from the Persian website 'Digikala.' The experimental results not only highlight the proposed method's superior semantic capabilities but also showcase its efficiency gains with an accuracy of 88.2% and an F1 score of 61.7. The importance of enhancing language models in this context lies in their pivotal role in extracting nuanced sentiments from user-generated content, ultimately advancing the field of sentiment analysis in Persian text mining by increasing efficiency and accuracy.
F. Jafarinejad
Abstract
In recent years, new word embedding methods have clearly improved the accuracy of NLP tasks. A review of the progress of these methods shows that the complexity of these models and the number of their training parameters grows increasingly. Therefore, there is a need for methodological innovation for ...
Read More
In recent years, new word embedding methods have clearly improved the accuracy of NLP tasks. A review of the progress of these methods shows that the complexity of these models and the number of their training parameters grows increasingly. Therefore, there is a need for methodological innovation for presenting new word embedding methodologies. Most current word embedding methods use a large corpus of unstructured data to train the semantic vectors of words. This paper addresses the basic idea of utilizing from structure of structured data to introduce embedding vectors. Therefore, the need for high processing power, large amount of processing memory, and long processing time will be met using structures and conceptual knowledge lies in them. For this purpose, a new embedding vector, Word2Node is proposed. It uses a well-known structured resource, the WordNet, as a training corpus and hypothesis that graphic structure of the WordNet includes valuable linguistic knowledge that can be considered and not ignored to provide cost-effective and small sized embedding vectors. The Node2Vec graph embedding method allows us to benefit from this powerful linguistic resource. Evaluation of this idea in two tasks of word similarity and text classification has shown that this method perform the same or better in comparison to the word embedding method embedded in it (Word2Vec). This result is achieved while the required training data is reduced by about 50,000,000%. These results provide a view of capacity of the structured data to improve the quality of existing embedding methods and the resulting vectors.
Document and Text Processing
A. Ahmadi Tameh; M. Nassiri; M. Mansoorizadeh
Abstract
WordNet is a large lexical database of English language, in which, nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets). Each synset expresses a distinct concept. Synsets are interlinked by both semantic and lexical relations. WordNet is essentially used for word ...
Read More
WordNet is a large lexical database of English language, in which, nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets). Each synset expresses a distinct concept. Synsets are interlinked by both semantic and lexical relations. WordNet is essentially used for word sense disambiguation, information retrieval, and text translation. In this paper, we propose several automatic methods to extract Information and Communication Technology (ICT)-related data from Princeton WordNet. We, then, add these extracted data to our Persian WordNet. The advantage of automated methods is reducing the interference of human factors and accelerating the development of our bilingual ICT WordNet. In our first proposed method, based on a small subset of ICT words, we use the definition of each synset to decide whether that synset is ICT. The second mechanism is to extract synsets which are in a semantic relation with ICT synsets. We also use two similarity criteria, namely LCS and S3M, to measure the similarity between a synset definition in WordNet and definition of any word in Microsoft dictionary. Our last method is to verify the coordinate of ICT synsets. Results show that our proposed mechanisms are able to extract ICT data from Princeton WordNet at a good level of accuracy.