File quality check
Requirement
A File Quality Check is a process of validating the accuracy, completeness, and integrity of a file or dataset to ensure that it meets the required standards before further processing, analysis, or reporting. This is a crucial step in data management, particularly when handling large datasets or files coming from various sources, to minimize errors and ensure the reliability of subsequent operations.
Identify missing or empty values (e.g., blank cells in a spreadsheet, empty fields in a JSON object). These could lead to incomplete reports or errors in data processing.
Validate that the file contains the expected number of rows and columns, especially if there are known rules for the structure of the dataset.
Validate if the fields has only numeric value or field length is expected length.
Ensure that fields that should have unique values (e.g., ID numbers, email addresses) do not contain duplicates.
Requirement: Develop Pyspark query to read the sales file from DBFS (dbfs:/FileStore/tables/sales_20240611.csv).
Below checks needs to be performed and display all sales data along with their corresponding check result.
|country_cd should not be null|
|qty_sold should be numeric only and not be null|
|product_id should not be duplicate|
|Date should be in yyyy-mm-dd format|