Volume 5 Number 12 (Dec. 2010)
Home > Archive > 2010 > Volume 5 Number 12 (Dec. 2010) >
JCP 2010 Vol.5(12): 1800-1809 ISSN: 1796-203X
doi: 10.4304/jcp.5.12.1800-1809

A Domain-Independent Data Cleaning Algorithm for Detecting Similar-Duplicates

Kazi Shah Nawaz Ripon1, Ashiqur Rahman2, and G.M. Atiqur Rahaman2
1 Department of Informatics, University of Oslo, Norway Computer Science and Engineering Discipline, Khulna University, Bangladesh
2 Computer Science and Engineering Discipline, Khulna University, Bangladesh


Abstract—Data mining algorithms generally assume that data will be clean and consistent. However, in practice, this is not always the case, and for this reason the detection and elimination of duplicate records is an important part of data cleaning. The presence of similar-duplicate records causes over-representation of data. If the database contains different representations of the same data, the results obtained from the data mining algorithm will be erroneous. The detection of similar-duplicate records is a difficult task, especially when the records are domain-independent. In this paper, we propose a novel domain-independent technique for better reconciling the similar-duplicate records. We also introduce new ideas for making similar-duplicate detection algorithms faster and more efficient. In addition, a significant modification of the transitivity rule is also proposed. Finally, we propose an algorithm that incorporates all these techniques for similar-duplicate detection into a domain-independent environment. The performance of the proposed method has been compared to other methods and the superiority of the proposed method has been confirmed by the experimental results.

Index Terms—Data cleaning, similar-duplicate, domainindependent, transitivity rule, approximate duplicate.

[PDF]

Cite: Kazi Shah Nawaz Ripon, Ashiqur Rahman, and G.M. Atiqur Rahaman, " A Domain-Independent Data Cleaning Algorithm for Detecting Similar-Duplicates," Journal of Computers vol. 5, no. 12, pp. 1800-1809, 2010.

General Information

ISSN: 1796-203X
Abbreviated Title: J.Comput.
Frequency: Bimonthly
Editor-in-Chief: Prof. Liansheng Tan
Executive Editor: Ms. Nina Lee
Abstracting/ Indexing: DBLP, EBSCO,  ProQuest, INSPEC, ULRICH's Periodicals Directory, WorldCat,etc
E-mail: jcp@iap.org
  • Nov 14, 2019 News!

    Vol 14, No 11 has been published with online version   [Click]

  • Mar 20, 2020 News!

    Vol 15, No 2 has been published with online version   [Click]

  • Dec 16, 2019 News!

    Vol 14, No 12 has been published with online version   [Click]

  • Sep 16, 2019 News!

    Vol 14, No 9 has been published with online version   [Click]

  • Aug 16, 2019 News!

    Vol 14, No 8 has been published with online version   [Click]

  • Read more>>