Comparative Study of K-Nearest Neighbour and Random Forest Machine Learning Algorithms for Spam Email Classification

Abstract:

Communication via email remains one of the cheapest and fastest means of communication in our society. An increase in the number of people who use email to send and receive messages has led to a subsequent increase in email spam. Email spam is a prevalent issue we face in our society to date. Spam usually costs very little to the spammers but leads to a waste of resources and could be a security threat on the part of users. A lot of research has been done to find ways to curb email spam and one of the most efficient means thus far is machine learning-based approaches. This paper seeks to describe a comparative study of Random Forest and K Nearest Neighbour machine learning algorithms in spam classification. Machine learning models were built using these two (2) algorithms, the models were trained and evaluated using the Enron Spam dataset consisting of both ham and spam emails. The Random Forest model had a classification accuracy of 98.03% with a false positive rate of 2.22% and a true positive rate (recall) of 98.30%. In comparison, the KNN model had a classification accuracy of 97.11% with a false positive rate of 2.81% and a true positive rate (recall) of 97.03%.

Keywords: Email spam, comparative study, machine learning algorithms, random forest, k nearest neighbour, spam classification.

Comparative Study of K-Nearest Neighbour and Random Forest Machine Learning Algorithms for Spam Email Classification

Leave a comment Cancel reply