ABSTRACT:
Text categorization (also known as text classification) is the task of automatically assigning documents to a category (or categories) from a pre-specified set. This task has several applications, including spam filtering, identification of document genre, automated indexing of scientific articles according to a predefined thesauri of technical terms, and even the automated extraction of metadata. The importance of text categorization cannot be overemphasized due to the fact that unstructured texts are the largest readily available source of data and manual organization of this data is infeasible due to the large number of documents involved as well as time constraints. The accuracy of modern text categorization machines rivals that of trained human professionals. This study experimentally compared four machine learning classifiers used in text categorization. These algorithms are; Naïve Bayes, Decision trees, k-Nearest Neighbour (kNN) and Support Vector Machines (SVM). These classifiers were developed using Python programming language. When run on the Reuters dataset, SVM significantly outperforms Naïve Bayes, kNN and Decision Trees. Decision trees performed worst of the four algorithms considered in this study. From observations made during the course of running these experiments, there seems to be a trade-off between simplicity and effectiveness. In conclusion, the results of this comparative analysis prove that SVM is the most effective of the classifiers considered in this study.
Keywords:
classifier, Decision trees, k-Nearest Neighbour (kNN), Machine learning ,Naïve Bayes, Support Vector Machines (SVM), text categorization, text classification