top of page

Creating data quality table using pyspark

Requirement

Information:

 

In life sciences, ensuring the accuracy, consistency, and integrity of d_product_revenue data is essential for compliance, regulatory reporting, and analysis. This PySpark script is designed to create a Data Quality (DQ) Table that validates and ensures the reliability of data in the transaction table. The DQ table consists of three columns: check_no, check_name, and dq_result, and systematically records the outcomes of various data quality checks for the purgo_playground.d_product_revenue table. These checks are critical for maintaining high-quality, trustworthy data that supports informed decision-making. The checks include: 

 

The results of checks are logged in the DQ table, providing a detailed overview of the dataset's quality and helping maintain the integrity of purgo_playground.d_product_revenue data. 

 

Requirement:

 

Develop a PySpark script to drop and create a Data Quality (DQ) table for the existing table purgo_playground.d_product_revenue. This DQ table should include three columns: check_no, check_name, and dq_result. It will be used to record the outcomes of various checks, with results logged as either "Pass" or "Fail" for each check.

 

Column Name Check: 

 

Validate the column names of the purgo_playground.d_product_revenue table against the expected list:

 

['product_id', 'product_name', 'product_type', 'revenue', 'country', 'customer_id', 'purchased_date', 'invoice_date', 'invoice_number', 'is_returned', 'customer_satisfaction_score', 'product_details'].

 

If all names match, log "Pass" in the dq_result column; otherwise, log "Fail." The check_name for this validation is "Column Name Validation."

 

Negative Value Check: 

 

Ensure that the revenue column in purgo_playground.d_product_revenue table do not contain negative values. If any negative values are found, log "Fail"; otherwise, log "Pass." 

 

The check_name for this check should be "Revenue Value Check." 

 

Decimal Precision Check: 

 

Verify that amount-based column like revenue have exactly two decimal places. If any column value has more or fewer than two decimal places, log "Fail"; otherwise, log "Pass." 

 

The check_name for this check should be "Decimal Precision Check." 

 

Unity Catalog Information: purgo_playground.d_product_revenue table

 

Excepted Output: Databricks PySpark code and purgo_playground.dq_check_table

Purgo AI Agentic Code

bottom of page