Spam Email Classification Optimization With NLP-Based Naïve Bayes on TF-IDF and SMOTE

  • Andi Maslan Department of Informatic Engineering, Universitas Putera Batam, Batam, Kepulauan Riau, 2943, Indonesia
  • Azan Rahman Department of Informatic Engineering, Universitas Putera Batam, Batam, Kepulauan Riau, 2943, Indonesia
  • Umar Faruq School of Information Technology, UNITAR International University Petaling Jaya, Selangor, Malaysia
  • Rabei Raad Ali Al-Jawr Department of Computer Engineering Technology, Northern Technical University, Mosul, Iraq
Keywords: Tokenization, SMOTE, TF-IDF, Spam Email Detection, Model Performance Optimization, Classification Accuracy Enhancement

Abstract

The rapid advancement of information and communication technology has transformed the way humans interact and exchange information. Among various digital communication tools, email remains one of the most widely used; however, it is often exploited to send spam messages. Spam emails can contain phishing links, malware, or unsolicited advertisements, posing significant risks to individuals and organizations. Therefore, developing accurate and efficient spam detection methods is becoming increasingly important. This study proposes a lightweight and efficient spam email classification approach using the naïve Bayes algorithm combined with TF-IDF feature extraction and the synthetic minority oversampling technique (SMOTE) to address class imbalance. A series of preprocessing steps tokenization, lemmatization, stopword removal, and term frequency-inverse document frequency (TF-IDF) transformation were applied to normalize and vectorize email text data. The SMOTE technique was applied precisely to the training dataset to balance the class distribution and avoid data leakage during evaluation. Experimental results showed that the naïve Bayes model initially achieved 88% accuracy, 86% recall, 100% precision, and 92% F1 score. After proper application of SMOTE, the model achieved 100% accuracy, precision, recall, and F1 score, indicating perfect classification of spam and non-spam (ham) emails. These results confirm that proper class balancing significantly improves the model’s ability to detect spam emails. Overall, this study highlights the effectiveness of combining TF-IDF, naïve Bayes, and SMOTE as a robust yet computationally efficient solution for modern spam detection, particularly suited to real-time and resource-constrained environments.

References

J. Hong, “The state of phishing attacks,” Commun. ACM, vol. 55, no. 1, pp. 74–81, Jan. 2012, doi: 10.1145/2063176.2063197.

D. Hylender, P. Langlois, A. Pinto, and S. Widup, “2024 Data Breach Investigations Report,” The Verizon DBIR Team, California, CA, USA, 2024.

World Economic Forum, “Global Cybersecurity Outlook 2024, 2024. [Online]. Available: https://www3.weforum.org/docs/WEF_Global_Cybersecurity_Outlook_2024.pdf

R.G. Broadhurst and H. Trivedi, “Malware in spam email: Risks and trends in the Australian spam intelligence database,” Trends Issues Crime Crim. Justice, no. 603, pp. 1–18, Sep. 2020, doi: 10.52922/ti04657.

Slashnext, “The State of Phishing 2023,” 2023. [Online]. Available: https://newsletter.radensa.ru/wp-content/uploads/2023/10/SlashNext-The-State-of-Phishing-Report-2023.pdf

B.P.S. Brodjonegoro, “Pembangunan digital Indonesia ke depan,” in Horizon Pembangunan Digital Indonesia 2025-2030, 1st ed., Nezar Patria, Ed., Jakarta, Indonesia: Kementerian Komunikasi dan Informatika, 2024.

M. Thomas and B. Meshram, “A brief review of network forensics process models and a proposed systematic model for investigation,” in Intell. Cyber Phys. Syst. Internet Things ICoICI 2022, 2023, pp. 599–627, doi: 10.1007/978-3-031-18497-0_45.

B. Fakiha, “Enhancing cyber forensics with AI and machine learning: A study on automated threat analysis and classification,” Int. J. Saf. Secur. Eng., vol. 13, no. 4, pp. 701–707, Sep. 2023, doi: 10.18280/ijsse.130412.

R. Al-Mugern, S.H. Othman, and A. Al-Dhaqm, “An improved machine learning method by applying cloud forensic meta-model to enhance the data collection process in cloud environments,” Eng. Technol. Appl. Sci. Res., vol. 14, no. 1, pp. 13017–13025, Feb. 2024, doi: 10.48084/etasr.6609.

C.S. Eze and L. Shamir, “Analysis and prevention of AI-based phishing email attacks,” Electronics, vol. 13, no. 10, pp. 1–13, May 2024, doi: 10.3390/electronics13101839.

W.A. Baroto, “Advancing digital forensic through machine learning: An integrated framework for fraud investigation,” Asia Pac. Fraud J., vol. 9, no. 1, pp. 1–16, Jan.-Jun. 2024, doi: 10.21532/apfjournal.v9i1.346.

N.N. Nicholas and V. Nirmalrani, “An enhanced mechanism for detection of spam emails by deep learning technique with bio-inspired algorithm,” e-Prime - Adv. Elect. Eng. Electron. Energy, vol. 8, pp. 1–9, Jun. 2024, doi: 10.1016/j.prime.2024.100504.

I.S.A. Munawar and E.T. Arujisaputra, “Comparison of the results of the naïve Bayes method and synthetic minority over sampling technique in sentiment analysis of user reviews,” J. Elektron. Sist. Inf., vol. 2, no. 1, pp. 179–188, Jun. 2024, doi: 10.31848/jesii.v2i1.3416.

A. Falasari and M.A. Muslim, “Optimize naïve Bayes classifier using chi square and term frequency inverse document frequency for amazon review sentiment analysis,” J. Soft Comput. Explor., vol. 3, no. 1, pp. 31–36, Mar. 2022, doi: 10.52465/joscex.v3i1.68.

B. Sonare et al., “E-mail spam detection using machine learning,” in 2023 4th Int. Conf. Emerg. Technol. (INCET), 2023, pp. 1–5, doi: 10.1109/INCET57972.2023.10170187.

G. Nasreen et al., “Email spam detection by deep learning models using novel feature selection technique and BERT,” Egypt. Inform. J., vol. 26, pp. 1–11, Jun. 2024, doi: 10.1016/j.eij.2024.100473.

M.M. Abualhaj et al., “Enhancing spam detection using Harris Hawks optimization algorithm,” TELKOMNIKA Telecommun. Comput. Electron. Control, vol. 23, no. 2, pp. 447–454, Apr. 2025, doi: 10.12928/TELKOMNIKA.v23i2.26615.

M. Jaiswal, S. Das, and Khushboo, “Detecting spam e-mails using stop word TF-IDF and stemming algorithm with naïve Bayes classifier on the multicore GPU,” Int. J. Elect. Comput. Eng., vol. 11, no. 4, pp. 3168–3175, Aug. 2021, doi: 10.11591/ijece.v11i4.pp3168-3175.

A.C. Sitepu, Wanayumini, and Z. Situmorang, “Determining bullying text classification using naive Bayes classification on social media,” J. Varian, vol. 4, no. 2, pp. 133–140, Apr. 2021, doi: 10.30812/varian.v4i2.1086.

M. Thomas and B.B. Meshram, “An efficient spam mail classification approach using improved moth flame optimization and multi-class support vector machine,” Int. J. Intell. Eng. Syst., vol. 16, no. 6, pp. 813–823, 2023, doi: 10.22266/ijies2023.1231.67.

C.N. Mohammed and A.M. Ahmed, “A semantic-based model with a hybrid feature engineering process for accurate spam detection,” J. Elect. Syst. Inf. Technol., vol. 11, pp. 1–16, Jul. 2024, doi: 10.1186/s43067-024-00151-3.

A. Rajesh and T. Hiwarkar, “Exploring preprocessing techniques for natural languagetext: A comprehensive study using Python code,” Int. J. Eng. Technol. Manag. Sci., vol. 7, no. 5, pp. 390–399, Sep./Oct. 2023, doi: 10.46647/ijetms.2023.v07i05.047.

M.R. Ningsih, J. Unjung, H.A. Farih, and M.A. Muslim, “Classification email spam using naive Bayes algorithm and chi-squared feature selection,” J. Appl. Intell. Syst. (JAIS), vol. 9, no. 1, pp. 74–87, Apr. 2024, doi: 10.62411/jais.v9i1.9695.

N. Tayefeh, “Email Spam,” Harvard Dataverse, doi: 10.7910/DVN/V8BSBP.

E.G. Dada et al., “Machine learning for email spam filtering: Review, approaches and open research problems,” Heliyon, vol. 5, no. 6, pp. 1–23, Jun. 2019, doi: 10.1016/j.heliyon.2019.e01802.

O. Ogundairo and P. Broklyn, “AI-driven phishing detection systems,” unpublished.

B.A. Kumari and C. Nagaraju, “Robust machine learning technique for detection and classification of spam mails,” unpublished.

N. Borisova, E. Karashtranova, and I. Atanasova, “The advances in natural language processing technology and its impact on modern society,” Int. J. Elect. Comput. Eng. (IJECE), vol. 15, no. 2, pp. 2325–2333, Apr. 2025, doi: 10.11591/ijece.v15i2.pp2325-2333.

Z. Abidin, A. Junaidi, and Wamiliana, “Text stemming and lemmatization of regional languages in Indonesia: A systematic literature review,” J. Inf. Syst. Eng. Bus. Intell., vol. 10, no. 2, pp. 217–231, Jun. 2024, doi: 10.20473/jisebi.10.2.217-231.

A.B.P. Negara, “The influence of applying stopword removal and smote on Indonesian sentiment classification,” Lontar Komput., J. Ilm. Teknol. Inf., vol. 14, no. 3, pp. 172–185, Dec. 2023, doi: 10.24843/lkjiti.2023.v14.i03.p05.

D. Khurana, A. Koli, K. Khatter, and S. Singh, “Natural language processing: State of the art, current trends and challenges,” Multimed. Tools Appl., vol. 82, no. 3, pp. 3713–3744, Jan. 2023, doi: 10.1007/s11042-022-13428-4.

A. Daud et al., “Identification of chatbot usage in online store services using natural language processing methods,” Adv. Sustain. Sci. Eng. Technol., vol. 6, no. 2, pp. 1–9, Feb.-Apr. 2024, doi: 10.26877/asset.v6i2.18309.

S. Atawneh and H. Aljehani, “Phishing email detection model using deep learning,” Electronics, vol. 12, no. 20, pp. 1–26, Oct. 2023, doi: 10.3390/electronics12204261.

F. Millennianita, U. Athiyah, and A.W. Muhammad, “Comparison of naïve Bayes classifier and support vector machine methods for sentiment classification of responses to bullying cases on Twitter,” J. Mechatron. Artif. Intell., vol. 1, no. 1, pp. 11–26, Jun. 2024, doi: 10.17509/jmai.v1i1.69959.

Published
2025-11-28
How to Cite
Andi Maslan, Azan Rahman, Umar Faruq, & Rabei Raad Ali Al-Jawr. (2025). Spam Email Classification Optimization With NLP-Based Naïve Bayes on TF-IDF and SMOTE. Jurnal Nasional Teknik Elektro Dan Teknologi Informasi, 14(4), 263-271. https://doi.org/10.22146/jnteti.v14i4.20931
Section
Articles