Document Type : Applied Article


Department of Computer Engineering, University of Zanjan, Iran.


Performing sentiment analysis on social networks big data can be helpful for various research and business projects to take useful insights from text-oriented content. In this paper, we propose a general pre-processing framework for sentiment analysis, which is devoted to adopting FastText with Recurrent Neural Network variants to prepare textual data efficiently. This framework consists of three different stages of data cleansing, tweets padding, word embedding’s extraction from FastText and conversion of tweets to these vectors, which implemented using DataFrame data structure in Apache Spark. Its main objective is to enhance the performance of online sentiment analysis in terms of pre-processing time and handle large scale data volume. In addition, we propose a distributed intelligent system for online social big data analytics. It is designed to store, process, and classify a huge amount of information in online. The proposed system adopts any word embedding libraries like FastText with different distributed deep learning models like LSTM or GRU. The results of the evaluations show that the proposed framework can significantly improve the performance of previous RDD-based methods in terms of processing time and data volume.


[1] B. Ait Hammou, A. Ait Lahcen, and S. Mouline, "Towards a real-time processing framework based on improved distributed Recurrent Neural Network variants with FastText for social big data analytics," Information Processing and Management, vol. 57, no. 1, pp. 102-122, 2020.
[2] H. Sadr, Mir M. Pedram, and M. Teshnehlab, "Convolutional Neural Network Equipped with Attention Mechanism and Transfer Learning for Enhancing Performance of Sentiment Analysis," Journal of AI and Data Mining, vol. 9, no. 2, pp. 141-151, 2021.
[3] A. Lakizadeh and Z. Zinaty, "A Novel Hierarchical Attention-based Method for Aspect-level Sentiment Classification," Journal of AI and Data Mining, vol. 9, no. 1, pp. 87-97, 2021.
[4] M.N. Farhan, H. Md Ahsan, and A. Md Arshad, "A study and performance comparison of mapreduce and apache spark on Twitter data on hadoop cluster," International Journal of Information Technology and Computer Science (IJITCS), vol. 10, no. 7, pp. 61-70, 2018.
[5] D. Kılınç, "A spark‐based big data analysis framework for real‐time sentiment prediction on streaming data," Software: Practice and Experience, vol. 49, no. 9, pp. 1352-1364, 2019.
[6] M. Kumar and B. Anju, "Analyzing Twitter sentiments through big data," in 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), 2016, pp. 2628-2631, IEEE, 2016.
[7] A. L'heureux, K. Grolinger, and HF. Elyamany, "Machine learning with big data: Challenges and approaches." IEEE Access, vol. 5, no. 1, pp. 7776-7797, 2017.
[8] E. Haddi, X. Liu, and Y. Shi, "The role of text pre-processing in sentiment analysis," Procedia Computer Science, vol. 17, no. 1, pp. 26-32, 2013.
[9] M.K. Sohrabi, and F. Hemmatian, "An efficient pre-processing  method for supervised sentiment analysis by converting sentences to numerical vectors: a twitter case study," Multimedia tools and applications, vol. 78, no. 17, pp. 24863-24882, 2019.
[10] M.W. Habib, and Z.N. Sultani, "Twitter Sentiment Analysis using Different Machine Learning and Feature Extraction Techniques," Al-Nahrain Journal of Science, vol. 24, no. 3, pp. 50-54, 2021.
[11] T. Singh and M. Kumari, "Role of text pre-processing in twitter sentiment analysis," Procedia Computer Science, vol. 89, no. 1, pp. 549-54, 2016.
[12] S. Symeonidis, D. Effrosynidis, and A. Arampatzis, "A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis," Expert Systems with Applications, vol. 110, no. 1, pp. 298-310, 2018.
[13] A.k. Uysal and S. Gunal, "The impact of pre-processing  on text classification," Information processing and management, vol. 50, no. 1, pp. 104-12, 2014.
[14] M. Zaharia, R.S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M.J. Franklin, and A. Ghodsi, "Apache spark: a unified engine for big data processing," Communications of the ACM, vol. 59, no.11, pp. 56-65, 2016.
[15] J. Damji, "RDD vs. DataFrames and Datasets: A Tale of Three Apache Spark APIs," Databriks Engineering Blog, pp. 1-1, 2016. [Accessed Sept. 18, 2021].
[16] S. Salloum, R. Dautov, X. Chen, PX. Peng, and ZH. Joshua, "Big data analytics on Apache Spark," International Journal of Data Science and Analytics, vol. 1, no. 3, pp. 145-164, 2016.
[17] Y. Bao, C. Quan, L. Wang, and F. Ren, "The role of pre-processing in twitter sentiment analysis," in International conference on intelligent computing, Springer, pp. 615–624, 2014.
[18] J.Y. Cho and E.H. Lee, "Reducing confusion about grounded theory and qualitative content analysis: Similarities and differences," Qualitative Report, vol. 19, no. 32, pp. 1-15, 2014.
[19] A. Kumar, S. Abirami, T.E. Trueman, and E. Cambria, "Comment toxicity detection via a multichannel convolutional bidirectional gated recurrent unit," Neurocomputing, vol. 441, no. 1, pp.272-8, 2021.
[20] T. Chen and C. Guestrin, "Xgboost: A scalable tree boosting system," in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785-794, 2016.
[21] BT. Hung BT, "Domain-specific versus general-purpose word representations in sentiment analysis for deep learning models," in Frontiers in intelligent computing: Theory and applications, Springer, pp. 252-264, 2020.
[22] F. Baratzadeh and Seyed M. H. Hasheminejad, "Customer Behavior Analysis to Improve Detection of Fraudulent ‎Transactions using Deep Learning," Journal of AI and Data Mining, vol. 10, no. 1, pp. 1-16, 2022.