DATA CLEANING BEFORE UPLOADING TO STORAGE

Volume 13 (1), February 2023, Pages 117-127

Elvin Jafarov


PhD student on Systematic analysis, management and information processing specialization, Azerbaijan State Oil and Industry University. Email id: Этот адрес электронной почты защищён от спам-ботов. У вас должен быть включен JavaScript для просмотра. ORCID: https://orcid.org/0000-0002-0288-4618


 ABSTRACT

 The article considered the issue of cleaning big data before uploading it to storage. At this time, the errors made and the methods of eliminating these errors have been clarified.

The technology of creating a big data storage and analysis system is reviewed, as well as solutions for the implementation of the first stages of the Data Science process: data acquisition, cleaning and loading are described. The results of the research allow us to move towards the realization of future steps in the field of big data processing.

It was noted that Data cleansing is an essential step in working with big data, as any analysis based on inaccurate data can lead to erroneous results.

Also, it was noted that cleaning and consolidation of data can also be performed when the data is loaded into a distributed file system.

The methods of uploading data to the storage system have been tested. An assembly from Hortonworks was used as the implementation. The easiest way to upload is to use the web interface of the Ambari system or to use HDFS commands to upload to HDFS Hadoop from the local system.

It has been shown that the ETL process should be considered more broadly than just importing data from receivers, minimal transformations and loading procedures into the warehouse. Data cleaning should become a mandatory stage of work, because the cost of storage is determined not only by the amount of data, but also by the quality of the information collected.  

Keywords: Big Data, Data Cleaning, Storage System, ETL process, Loading methods.


DOI