Volume 12 Number 4 (Jul. 2017)
Home > Archive > 2017 > Volume 12 Number 4 (Jul. 2017) >
JCP 2017 Vol.12(4): 362-370 ISSN: 1796-203X
doi: 10.17706/jcp.12.4.362-370

Efficient Cross User Client Side Data Deduplication in Hadoop

Priteshkumar Prajapati, Parth Shah, Amit Ganatra, Sandipkumar Patel
1Department of Information Technology, C.S.P.I.T., CHARUSAT, Anand, India.
2Department of Computer Engineering, C.S.P.I.T., CHARUSAT, Anand, India.


Abstract—Hadoop is widely used for applications like Aadhaar card, Healthcare, Media, Ad Platform, Fraud Detection & Crime, and Education etc. However, it does not provide efficient and optimized data storage solution. One interesting thing we found that when user uploads the same file twice with same file name it doesn’t allow saving the same file. But when user uploads the same file content with different file name Hadoop allows uploading that file. In general same files are uploaded by many users (cross user) with different name with same contents so this leads to wastage of storage space. So we provided the solution of above problem and provide Data Deduplication in Hadoop. Before uploading data to HDFS we calculate Hash Value of File and stored that Hash Value in Database for later use. Now same or other user wants to upload the same content file but with same content, our DeDup module will calculate Hash value and verify it to HBase. Now if Hash Value is matched so it will give message that “File is already exits”. Experimental analysis demonstrates (i.e. Text, Audio, Video, Zip files etc.) that proposed solution gives more optimized storage acquiring very small computation overhead and having optimized storage space.

Index Terms—Cloud storage, deduplication, Hadoop, Hadoop distributed file system, Hadoop database.

[PDF]

Cite: Priteshkumar Prajapati, Parth Shah, Amit Ganatra, Sandipkumar Patel, "Efficient Cross User Client Side Data Deduplication in Hadoop," Journal of Computers vol. 12, no. 4, pp. 362-370, 2017.

General Information

ISSN: 1796-203X
Abbreviated Title: J.Comput.
Frequency: Bimonthly
Editor-in-Chief: Prof. Liansheng Tan
Executive Editor: Ms. Nina Lee
Abstracting/ Indexing: DBLP, EBSCO,  ProQuest, INSPEC, ULRICH's Periodicals Directory, WorldCat,etc
E-mail: jcp@iap.org
  • Nov 14, 2019 News!

    Vol 14, No 11 has been published with online version   [Click]

  • Mar 20, 2020 News!

    Vol 15, No 2 has been published with online version   [Click]

  • Dec 16, 2019 News!

    Vol 14, No 12 has been published with online version   [Click]

  • Sep 16, 2019 News!

    Vol 14, No 9 has been published with online version   [Click]

  • Aug 16, 2019 News!

    Vol 14, No 8 has been published with online version   [Click]

  • Read more>>