This repository demonstrates a structured approach to integrating Azure Databricks with Azure Data Lake Storage Gen2 and Azure SQL Database. It outlines a data pipeline that ingests raw data, processes it through bronze and silver layers, and finally writes the refined data to an Azure SQL Database.
-
datasets/: Contains sample datasets for demonstration purposes.
-
1.mount_adls_gen2.py: Script to mount Azure Data Lake Storage Gen2 to Databricks.
-
2.create_databases.sql: SQL script to create necessary databases within Databricks.
-
3.bronze_layer_table.py: Processes and stores raw data into the bronze layer.
-
4.silver_layer_table.py: Transforms bronze data and stores it into the silver layer.
-
5.silver_to_sql_database.py: Transfers data from the silver layer to an Azure SQL Database.
- Azure Subscription
- Azure Databricks Workspace
- Azure Data Lake Storage Gen2
- Azure SQL Database
- Databricks UI access
- Required credentials and access rights for mounting and DB connections
- Update
1.mount_adls_gen2.pywith:- Storage account name
- Container name
- Client ID, Tenant ID, and Secret
- Run the script in Databricks to mount the storage.
- Run
2.create_databases.sqlin a SQL notebook to createbronze_dbandsilver_db.
- Run
3.bronze_layer_table.pyto load raw files from ADLS into a bronze table.
- Run
4.silver_layer_table.pyto apply transformations and store clean data in the silver table.
- Update
5.silver_to_sql_database.pywith your JDBC connection string, credentials, and target table info. - Run the script to push data from the silver layer to Azure SQL DB.
- Ingest data from Azure SQL DB in PowerBI and create visualisations.