Comparative Analysis of Clustering Algorithms on High Dimensionality Data

Abstract:

Data mining is an emerging research area employed by many evolving computing technologies since it reduces dataset complexity by providing remarkable insight into the data. Additionally, it requires the ability to creatively envision the enormous and heterogeneous datasets and to extract meaningful knowledge from the plethora of data through the practical application of appropriate algorithms. For this reason, clustering algorithms are categorized as hierarchical, partitioning, and density based and grid based. The Partitioning Clustering technique divides the data objects into several groups known as partitions, and each division represents a cluster. A hierarchy or tree of clusters is created for the data objects using hierarchical clustering algorithms. The cluster is in areas with high densities by density based algorithms, which aggregate their data objects based on a particular neighbourhood. The grid structure used by a grid based algorithm is created as the data object space is divided into a finite number of cells. Moreover, clustering is a technique that is frequently used in data mining to examine the data; thus the authors were motivated to compare it with other approaches. A data mining analysis is useful for gaining an understanding of the distribution of data, observing the characteristics of clusters, and focusing on certain clusters for further analysis. This work focuses on determining the algorithm with better performance on high dimensionality data between Expectation Maximization (EM) and Hierarchical Algorithms (HA) using cluster accuracy and evaluation time as parameters for comparison. In this study, cluster analysis was performed using WEKA 3.8.5. The result shows that the EM method runtime and accuracy perform better in clustering high dimension data and performance improves as the number of clusters increases. However, in the HA method, running time and accuracy barely improved with the difference in the dataset. Therefore, it is observed that the HA method falls short in performance compared to the EM method.

Keywords: clustering, high dimensionality data, Expectation Maximization, Hierarchical Algorithm, cluster analysis, WEKA.