top of page
Partitioned parquet backup of customer 360 data
Requirement
Information: Data backup is essential for retrieval in case of any issues, but it also consumes storage space. Therefore, it is important to save the data in an efficient way that doesn’t compromise storage or data quality.
Requirement: Develop a PySpark script to store the customer_360_raw table data as a compressed parquet file in a Databricks volume, partitioned by state. Additionally, perform Databricks VACUUM operation on the original customer_360_raw table to retaining only files the last 30 days (equivalent to 720 hours).
Volume Information:
/Volumes/agilisium_playground/purgo_playground/customer_360_raw_backup
Unity catalog information: customer_360_raw
Expected output:
* Databricks Pyspark / spark SQL code
* Parquet files
Purgo AI Agentic Code
bottom of page