Creating data quality table using pyspark
Requirement
Information:
In life sciences, ensuring the accuracy, consistency, and integrity of d_product_revenue data is essential for compliance, regulatory reporting, and analysis. This PySpark script is designed to create a Data Quality (DQ) Table that validates and ensures the reliability of data in the transaction table. The DQ table consists of three columns: check_no, check_name, and dq_result, and systematically records the outcomes of various data quality checks for the purgo_playground.d_product_revenue table. These checks are critical for maintaining high-quality, trustworthy data that supports informed decision-making. The checks include:
The results of checks are logged in the DQ table, providing a detailed overview of the dataset's quality and helping maintain the integrity of purgo_playground.d_product_revenue data.
Requirement:
Develop a PySpark script to drop and create a Data Quality (DQ) table for the existing table purgo_playground.d_product_revenue. This DQ table should include three columns: check_no, check_name, and dq_result. It will be used to record the outcomes of various checks, with results logged as either "Pass" or "Fail" for each check.
Column Name Check:
Validate the column names of the purgo_playground.d_product_revenue table against the expected list:
['product_id', 'product_name', 'product_type', 'revenue', 'country', 'customer_id', 'purchased_date', 'invoice_date', 'invoice_number', 'is_returned', 'customer_satisfaction_score', 'product_details'].
If all names match, log "Pass" in the dq_result column; otherwise, log "Fail." The check_name for this validation is "Column Name Validation."
Negative Value Check:
Ensure that the revenue column in purgo_playground.d_product_revenue table do not contain negative values. If any negative values are found, log "Fail"; otherwise, log "Pass."
The check_name for this check should be "Revenue Value Check."
Decimal Precision Check:
Verify that amount-based column like revenue have exactly two decimal places. If any column value has more or fewer than two decimal places, log "Fail"; otherwise, log "Pass."
The check_name for this check should be "Decimal Precision Check."
Unity Catalog Information: purgo_playground.d_product_revenue table
Excepted Output: Databricks PySpark code and purgo_playground.dq_check_table