TY - JOUR TI - Handling Duplicate Data in Big Data AU - Jony Kumar AU - Mamta Yadav JO - International Journal of Scientific Research in Computer Science, Engineering and Information Technology PB - Technoscience Academy DA - 2018/06/30 PY - 2018 DO - https://doi.org/10.32628/IJSRCSEIT UR - https://ijsrcseit.com/CSEIT1835255 VL - 3 IS - 5 SP - 1163 EP - 1167 AB - The problem of detecting and eliminating duplicated data is one of the major problems in the broad area of data cleaning and data quality in data warehouse. Many times, the same logical real world entity may have multiple representations in the data warehouse. Duplicate elimination is hard because it is caused by several types of errors like typographical errors, and different representations of the same logical value. Also, it is important to detect and clean equivalence errors because an equivalence error may result in several duplicate tuples. Recent research efforts have focused on the issue of duplicate elimination in data warehouses. This entails trying to match inexact duplicate records, which are records that refer to the same real-world entity while not being syntactically equivalent. This paper mainly focuses on efficient detection and elimination of duplicate data. The main objective of this research work is to detect exact and inexact duplicates by using duplicate detection and elimination rules. This approach is used to improve the efficiency of the data. The importance of data accuracy and quality has increased with the explosion of data size. This factor is crucial to ensure the success of any cross enterprise integration applications, business intelligence or data mining solutions.