An Empirical and Conceptual Analysis of Clustering Methods for Unlabeled Data Sets

P Krishnamoorthy; R P Kannan; A M Sarravanaprabhu; A J Rajeswari Joe

doi:10.34293/sijash.v13iS3-i1-Feb.10258

P Krishnamoorthy Associate Professor, PG & Research, Department of Computer Science, Thiruthangal Nadar College, Chennai, Tamil Nadu, India
R P Kannan Assistant Professor, Ramakrishna Mission Vivekananda College Mylapore, Chennai, Tamil Nadu, India
A M Sarravanaprabhu Assistant Professor, Department of Computer Science, Velammal Institute of Technology, Panchetti
A J Rajeswari Joe Associate Professor, PG & Research, Department of Computer Science, Thiruthangal Nadar College, Chennai, Tamil Nadu, India

DOI: https://doi.org/10.34293/sijash.v13iS3-i1-Feb.10258

Keywords: Data Mining, Unsupervised Learning, Clustering Techniques, Comparative Analysis, Density-Based Clustering

Abstract

The digital technologies and information systems have developed rather quickly which resulted in the creation of very large amounts of data in the field of business analytics, healthcare, scientific research, social networks and smart systems. A large part of this data is not created with any class labels, limiting the suitability of supervised learning methods. Here, clustering has become an essential unsupervised method of data mining that allows one to identify natural structures and patterns in unclassified data. Clustering helps in the exploratory analysis, pattern discoveries, and summarizing data as it allows the data objects that are similar to be grouped together using the internal properties. The paper is a totally original and plagiarism-free analysis of the clustering methods with a high level of conceptual clarity, evaluation, and applicability. Three key clustering paradigms; centroid-based, hierarchical and density-based clustering are discussed based on behavioral and performance perspective. The study is not interested in replicating algorithmic descriptions but how these paradigms can respond to the data size, density, noise and distribution variations. One representative dataset is taken to explain the behavior of the algorithms and several comparative tables with numerical and percentage-like measures are presented to make it more clear. The results shown indicate that the effectiveness of clustering is strongly reliant on the characteristics of data and it is important to be careful with the choice of algorithm to get meaningful and reliable data mining results.