top of page

Partitioned parquet backup of customer 360 data

Requirement

Information: Data backup is essential for retrieval in case of any issues, but it also consumes storage space. Therefore, it is important to save the data in an efficient way that doesn’t compromise storage or data quality.

 

Requirement: Develop a PySpark script to store the customer_360_raw table data as a compressed parquet file in a Databricks volume, partitioned by state. Additionally, perform Databricks VACUUM operation on the original customer_360_raw table to retaining only files the last 30 days (equivalent to 720 hours).

 

Volume Information:

 

/Volumes/agilisium_playground/purgo_playground/customer_360_raw_backup

 

Unity catalog information: customer_360_raw

 

Expected output:

 

* Databricks Pyspark / spark SQL code

* Parquet files

Purgo AI Agentic Code

bottom of page