top of page

Data validation for Platform Migration

Requirement

Objective: Data Validation is a critical step in the data pipeline migration process, designed to ensure data consistency and accuracy between the source and target platforms. This process is essential for verifying that data integrity is maintained when transitioning data pipelines from AWS S3 platform as source table with delta table as target.

 

Requirement: Extract data validation on AWS S3 source with Databricks Delta target table. Read the data from d_product.csv stored in AWS S3 with inferring its schema and compare it with the corresponding d_product table with inferring its schema in Databricks. Generate a validation report highlighting both row-level and column-level differences to provide Total Column count and Total Tow count from source to target data, based on the following criteria:

 

Column-Level Validation:* Compare Total Column counts for both Source table and Target table. If any mismatches are found, display the column names only.

Row-Level Validation: Compare Total record counts for both Source table and Target table. If discrepancies exist, display the missing or unmatched prod_id* values.

 

Primary key of both tables: prod_id

 

Source table path (AWS S3 path): s3://agilisium-playground-dev/filestore/purgo/d_product.csv

 

Target table: purgo_playground.d_product

 

Access and Secret key detail: Configure Spark to access S3 by retrieving the access_key and secret_key securely from the Databricks secret scope aws_keys.

 

Expected Codebase: PySpark

 

Output: Show only the results.

Purgo AI Agentic Code

bottom of page