Review and Comparison between Clustering Algorithms with Duplicate Entities Detection Purpose

maryam bakhshi, Mohammad-Reza Feizi-Derakhshi, Elnaz Zafarani

Abstract


the issue of identifying iterative records issue is one of the challenging issues in the field of databases. As a result, finding appropriate algorithms in this field helps significantly to organize information and extract the correct answer from different queries of database. One of steps of duplicate detection is clustering. Clustering is a classification process of existing data sets into different clusters so that, the similarity among data within each cluster is maximum and similarity among the data of different clusters is at least. The aim of this paper is to find appropriate clustering algorithms for Iteration Detection issues on existing data set. In this paper, 4 algorithms, K-Means, Single-Linkage, DBSCAN and Self-Organizing Maps have been implemented and compared. F1 measure was used in order to measure accuracy and quality of clustering, that according to the obtained results, SOM algorithm obtained high accuracy. F1 measure was used in order to evaluate precision and quality of clustering that by studying the obtained results, the SOM algorithm obtained high F1 measure. Also a comparison between 2 methods, mapping to two dimensional space and statistical average, performed, that according to the results, mapping method is better than average method.


Full Text:

PDF

Refbacks

  • There are currently no refbacks.


Copyright © ExcelingTech Publisher, UK