Machine Learning for SPAM Detection

PDF

Published: 2023-03-17

Page: 167-179


Phani Teja Nallamothu *

Strava, United States.

Mohd Shais Khan

Osmania University, Hyderabad, Telangana, India.

*Author to whom correspondence should be addressed.


Abstract

In practically every industry today, from business to education, emails/messages are used. Ham and spam are the two subcategories of emails/messages. Email or message spam, often known as junk email or unwelcome email, is a kind of message that can be used to hurt any user by sapping their time and computing resources and stealing important data. Spam messages volume is rising quickly day by day. Today's email and IoT service providers face huge and massive challenges with spam identification and filtration. Spam filtering is one of the most important and well-known methods among all the methods created for identifying and preventing spam. This has been accomplished using a number of machine learning and deep learning techniques, including Naive Bayes, decision trees, neural networks, and random forests. By categorizing them into useful groups, this study surveys the machine learning methods used for spam filtering. Based on accuracy, precision, recall, etc., a thorough comparison of different methods is also made.

Keywords: Spam, ham, machine learning, supervised machine learning


How to Cite

Nallamothu, P. T., & Khan, M. S. (2023). Machine Learning for SPAM Detection. Asian Journal of Advances in Research, 6(1), 167–179. Retrieved from https://mbimph.com/index.php/AJOAIR/article/view/3417

Downloads

Download data is not yet available.

References

Faris H, Al-Zoubi AM, Heidari AA, Aljarah I, Mafarja M, Hassonah MA, et al. An intelligent system for spam detection and identification of the most relevant features based on evolutionary random weight networks. Inf Fusion. 2019;48:67-83. DOI: 10.1016/j.inffus.2018.08.002

Blanzieri E, Bryl A. A survey of learning-based techniques of email spam filtering. Artif Intell Rev. 2008;29(1):63-92. DOI: 10.1007/s10462-009-9109-6

Choudhary K, Garrity KF, Reid ACE, DeCost B, Biacchi AJ, Hight Walker AR, et al. The joint automated repository for various integrated simulations (JARVIS) for data-driven materials design. npj Comp Mater. 2020;6(1):173.

DOI: 10.1038/s41524-020-00440-1

Kirklin S, Saal JE, Meredig B, Thompson A, Doak JW, Aykol M, et al. The open quantum materials database (OQMD): assessing the accuracy of DFT formation energies. npj Comp Mater. 2015;1(1):1-15.

DOI: 10.1038/npjcompumats.2015.10

Jain A, Ong SP, Hautier G, Chen W, Richards WD, Dacek S, et al. Commentary: The materials project: A materials genome approach to accelerating materials innovation. APL Mater. 2013;1(1):011002.

DOI: 10.1063/1.4812323

Alghoul A, et al. Email classification using artificial neural network; 2018.

Udayakumar N, Anandaselvi S, Subbulakshmi T. Dynamic malware analysis using machine learning algorithm. In: International Conference on Intelligent Sustainable Systems (ICISS). Vol. 2017. IEEE Publications; 2017.

DOI: 10.1109/ISS1.2017.8389286

Olatunji SO. Extreme learning machines and support vector machines models for email spam detection. In: 30th Canadian Conference on Electrical and Computer Engineering (CCECE). IEEE Publications. IEEE Publications; 2017.

DOI: 10.1109/CCECE.2017.7946806

Dou Y, Ma G, Yu PS, Xie S. Robust spammer detection by nash reinforcement learning. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining; 2020:924-33.

DOI: 10.1145/3394486.3403135

Lai G-H, Chen C, Laih C, Chen T. A collaborative anti-spam system. Expert Syst Appl. 2009;36(3):6645-53.

DOI: 10.1016/j.eswa.2008.08.075

Dean J. Large scale deep learning. In: Keynote GPU Technical Conference; 2015.

Chiu Y-F, Chen C, Jeng B, Lin H. An alliance-based anti-spam approach. In: Third International Conference on Natural Computation (ICNC 2007). IEEE Publications; 2007. DOI: 10.1109/ICNC.2007.173

Smadi S, Aslam N, Zhang L. Detection of online phishing email using dynamic evolving neural network based on reinforcement learning. Decis Support Syst. 2018;107:88-102. DOI: 10.1016/j.dss.2018.01.001

Narisawa K, et al. Unsupervised spam detection based on string alienness measures. in Discovery Science: 10th International Conference. Proceedings, DS 2007 Sendai, Japan. Springer. 2007;10.

Sasaki M, Shinnou H. Spam detection using text clustering. In: International Conference on Cyberworlds (CW’05). Vol. 2005. IEEE Publications; 2005. DOI: 10.1109/CW.2005.83

Kruschke JK, Liddell TM. Bayesian data analysis for newcomers. Psychon Bull Rev. 2018;25(1):155-77.

DOI: 10.3758/s13423-017-1272-1, PMID 28405907.

Adewole KS, Anuar NB, Kamsin A, Varathan KD, Razak SA. Malicious accounts: Dark of the social networks. J Netw Comput Appl. 2017;79:41-67. DOI: 10.1016/j.jnca.2016.11.030

Zhuang L, et al. Characterizing botnets from email spam records. Leet. 2008; 8(1):1-9.

Barushka A, Hájek P. Spam filtering using regularized neural networks with rectified linear units. In: AI* IA 2016 advances in artificial intelligence. XVth International Conference of the Italian Association for Artificial Intelligence, Genova, Italy, November 29 - December 1, 2016, proceedings. Springer; 2016;XV: 65-75.

DOI: 10.1007/978-3-319-49130-1_6

Jamil F, Kahng HK, Kim S, Kim DH. Towards secure fitness framework based on IoT-enabled blockchain network integrated with machine learning algorithms. Sensors (Basel). 2021; 21(5):1640. DOI: 10.3390/s21051640, PMID 33652773.

Arif MH, Li J, Iqbal M, Liu K. Sentiment analysis and spam detection in short informal text using learning classifier systems. Soft Comput. 2018;22(21): 7281-91. DOI: 10.1007/s00500-017-2729-x

Zheng X, Zhang X, Yu Y, Kechadi T, Rong C. ELM-based spammer detection in social networks. J Supercomput. 2016;72(8): 2991-3005. DOI: 10.1007/s11227-015-1437-5

Cresci S, Petrocchi M, Spognardi A, Tognazzi S. On the capability of evolved spambots to evade detection via genetic engineering. Online Soc Netw Media. 2019;9:1-16 DOI: 10.1016/j.osnem.2018.10.005

Saleh AJ, Karim A, Shanmugam B, Azam S, Kannoorpatti K, Jonkman M, et al. An intelligent spam detection model based on artificial immune system. Information. 2019;10(6):209. DOI: 10.3390/info10060209.

Vyas T, Prajapati P, Gadhwal S. A survey and evaluation of supervised machine learning techniques for spam e-mail filtering. In: IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT). 2015;2015. DOI: 10.1109/ICECCT.2015.7226077

Jain AK, Gupta BB. Towards detection of phishing websites on client-side using machine learning based approach. Telecommun Syst. 2018;68(4):687-700. DOI: 10.1007/s11235-017-0414-0

Pathan M, Kamble V. A review various techniques for content based spam filtering. Eng Technol. 2018;4.

Jain AK, Gupta BB. A novel approach to protect against phishing attacks at client side using auto-updated white-list. EURASIP J Inf Sec. 2016;2016:1-11.

Bhowmick A, Hazarika SM. Machine learning for e-mail spam filtering [review]. Techniques and trends. arXiv preprint arXiv:1606.01042, 2016.

Bassiouni M, Ali M, El-Dahshan EA. Ham and spam e-mails classification using machine learning techniques. J Appl Sec Res. 2018;13(3):315-31. DOI: 10.1080/19361610.2018.1463136

Ara J. A survey of existing e-mail spam filtering methods considering machine learning techniques. Glob J Comput Sci Technol. 2018;18(C2):21-9.

Méndez JR, Cotos-Yañez TR, Ruano-Ordás D. A new semantic-based feature selection method for spam filtering. Appl Soft Comput. 2019;76:89-104. DOI: 10.1016/j.asoc.2018.12.008

Petersen LN. The ageing body in Monty Python Live (Mostly). Eur J Cult Stud. 2018;21(3):382-94.

DOI: 10.1177/1367549417708435

Gansterer WN, Janecek AGK, Neumayer R. Spam filtering based on latent semantic indexing. In: Berry MW, Castellanos M, editors. Survey of text mining II: Clustering, classification, and retrieval. London: Springer. London. 2008;165-83.

Lee D, Lee MJ, Kim BJ. Deviation-based spam-filtering method via stochastic approach. Europhys Lett. 2018;121(6): 68004. DOI: 10.1209/0295-5075/121/68004

Jain AK, Gupta BB. Towards detection of phishing websites on client-side using machine learning based approach. Telecommun Syst. 2018;68(4):687-700. DOI: 10.1007/s11235-017-0414-0

Ahmed N, Amin R, Aldabbas H, Koundal D, Alouffi B, Shah T. Machine learning techniques for spam detection in Email and IoT platforms: analysis and research challenges. Sec Commun Netw. 2022; 2022:1-19.

DOI: 10.1155/2022/1862888

Cabrera-León Y, García Báez P, Suárez-Araujo CP. Non-email spam and machine learning-based anti-spam filters: Trends and some remarks. In: Computer Aided Syst Theor–EUROCAST: 16th International Conference, Las Palmas de Gran Canaria, Spain, Feb 19-24, 2017. Revised selected papers. part I16. Springer. 2017;2018.

Subasi A, Alzahrani S, Aljuhani A, Aljedani M. Comparison of decision tree algorithms for spam E-mail filtering. In: 1st International Conference on Computer Applications & Information Security (ICCAIS). 2018;2018. DOI: 10.1109/CAIS.2018.8442016

Hijawi W, Faris H, Alqatawna J, Al-Zoubi AM, Aljarah I. Improving email spam detection using content based feature engineering approach. In: IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT). 2017;2017. DOI: 10.1109/AEECT.2017.8257764

DeBarr D, Wechsler H. Using social network analysis for spam detection. In: Advances in social computing. Berlin, Heidelberg: Springer Berlin Heidelberg; 2010;62-9. DOI: 10.1007/978-3-642-12079-4_10

Faris H, Aljarah I, Al-Shboul B. A hybrid approach based on particle swarm optimization and random forests for E-mail spam filtering. In: Computational collective intelligence. Cham: Springer International Publishing. 2016;498-508. DOI: 10.1007/978-3-319-45243-2_46

Kotsiantis SB, Zaharakis I, Pintelas P. Supervised machine learning: A review of classification techniques. Emerg Artif Intell Appl Comput Eng. 2007;160(1):3-24.

Jiang S, Pang G, Wu M, Kuang L. An improved K-nearest-neighbor algorithm for text categorization. Expert Syst Appl. 2012;39(1):1503-9. DOI: 10.1016/j.eswa.2011.08.040

Fine S, Singer Y, Tishby N. The hierarchical hidden markov model: Analysis and applications. Mach Learn. 1998;32(1):41-62. DOI: 10.1023/A:1007469218079

Abe N, Warmuth MK. On the computational complexity of approximating distributions by probabilistic automata. Mach Learn. 1992;9(2-3):205-60. DOI: 10.1007/BF00992677

Baldi P, Chauvin Y, Hunkapiller T, McClure MA. Hidden Markov models of biological primary sequence information. Proc Natl Acad Sci U S A (USA). 1994;91(3):1059-63. DOI: 10.1073/pnas.91.3.1059, PMID 8302831.

Bengio Y, Frasconi P. An input-output HMM architecture. In: Tesauro G, Touretzky DS, Leen TK, editors. Advances in neural information processing systems. Cambridge, MA: MIT Press; 1995.

Gat I, Tishby N, Abeles M. Hidden Markov modeling of simultaneously recorded cells in the associative cortex of behaving monkeys. Netw Comput Neural Syst. 1997;8.

Cover T, Thomas J. Elements of information theory. Wiley; 1991.

Ahmed AH, Mikki M. Improved spam detection using DBSCAN and advanced digest algorithm. Int J Comput Appl. 2013;69(25):11-6. DOI: 10.5120/12126-8300

Tan E, Guo L, Chen S, Zhang X, Zhao Y. Unik: unsupervised social network spam detection. In: Proceedings of the 22nd ACM international conference on information & knowledge management; 2013:479-88. DOI: 10.1145/2505515.2505581

Sharma A, Rastogi V. Spam filtering using K mean clustering with local feature selection classifier. Int J Comput Appl. 2014;108(10):35-9. DOI: 10.5120/18951-0096

Hsiao W-F, Chang T-M. An incremental cluster-based approach to spam filtering. Expert Syst Appl. 2008;34(3):1599-608. DOI: 10.1016/j.eswa.2007.01.018

Ahuja R, Chug A, Gupta S, Ahuja P, Kohli S. Classification and clustering algorithms of machine learning with their applications. In: Nature-inspired computation in data mining and machine learning; 2020. p. 225-48.

DOI: 10.1007/978-3-030-28553-1_11

Li W, Meng W, Tan Z, Xiang Y. Design of multi-view based email classification for IoT systems via semi-supervised learning. J Netw Comput Appl. 2019;128:56-63. DOI: 10.1016/j.jnca.2018.12.002

Diale M, Celik T, Van Der Walt C. Unsupervised feature learning for spam email filtering. Comput Electr Eng. 2019;74:89-104. DOI: 10.1016/j.compeleceng.2019.01.004

Peng W, Huang L, Jia J, Ingram E. Enhancing the naive Bayes spam filter through intelligent text modification detection. In: 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications. 2018;2018. DOI:10.1109/TrustCom/BigDataSE.2018.00122

Zeng Z, et al. Spammer Detection on Weibo Social Network. In: IEEE 6th International Conference on Cloud Computing Technology and Science. 2014;2014.

Lei K, Liu Y, Zhong S, Liu Y, Xu K, Shen Y, et al. Understanding user behavior in Sina Weibo online social network: A community approach. IEEE Access. 2018;6:13302-16. DOI: 10.1109/ACCESS.2018.2808158

Lin C, He J, Zhou Y, Yang X, Chen K, Song L. Analysis and identification of spamming behaviors in Sina Weibo microblog. In: Proceedings of the 7th workshop on social network mining and analysis. Chicago: Association for Computing Machinery. 2013:Article 5. DOI: 10.1145/2501025.2501035

Rusland, N.F., et al. Analysis of naïve Bayes algorithm for Email spam filtering across multiple datasets. IOP Conf S Mater Sci Eng. 2017;226(1):012091.

Singh A, Batra S. Ensemble based spam detection in social IoT using probabilistic data structures. Future Gener Comput Syst. 2018;81:359-71. DOI: 10.1016/j.future.2017.09.072

Xu H, Sun W, Javaid A. Efficient spam detection across Online Social Networks. In: IEEE International Conference on Big Data Analysis (ICBDA). 2016;2016. DOI: 10.1109/ICBDA.2016.7509829

Faris H, Aljarah I, Alqatawna J. Optimizing feedforward neural networks using Krill Herd algorithm for E-mail spam detection. In: IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT). 2015;2015. DOI: 10.1109/AEECT.2015.7360576

Opera: state of the mobile [web]. Available:http://www.opera.com/smw/2009/12

Sahami M, Dumais S, Heckerman D, Horvitz E. A bayesian approach to filtering junk e-mail. In: AAAI Workshop on Learning for Text Categorization; 1998.

Gyöngyi Z, Garcia-Molina H, Pedersen J. Combating web spam with trustrank. In: Proceedings of the thirtieth international conference on very large data bases. 2004;576-87.

Gyongyi Z, Berkhin P, Garcia-Molina H, Pedersen J. Link spam detection based on mass estimation. In: VLDB 2006. Proceedings of the 32nd international conference on very large data bases. 2006;439-50.

Zhou D, Burges CJC, Tao T. Transductive link spam detection. In: Proceedings of the 3rd international workshop on adversarial information retrieval on the web. 2007;21-8. DOI: 10.1145/1244408.1244413

Geng GG, Li Q, Zhang X. Link based small sample learning for web spam detection. In: Proceedings of the 18th international conference on world wide web. 2009; 1185-6. DOI: 10.1145/1526709.1526920

Wu Y-S, Bagchi S, Singh N, Wita R. Spam detection in voice-over-ip calls through semi-supervised clustering. In: Proceedings of the. Dependable systems networks. 2009;307-16. DOI: 10.1109/DSN.2009.5270323

Benevenuto F, Rodrigues T, Almeida V, Almeida J, Gonçalves M. Detecting spammers and content promoters in online video social networks. In: Proceedings of the 32nd international ACM SIGIR conference. 2009;620-7.

DOI: 10.1145/1571941.1572047

Krishnamurthy B, Gill P, Arlitt M. A few chirps about twitter. In: Proceedings of the first workshop on online social networks. 2008;19-24. DOI: 10.1145/1397735.1397741