Recently I put together a prototype of using Python code in Azure Databricks to clean-up data and then efficiently insert it into Azure Synapse Analytics (previously known as Azure Data Warehouse) tables. This post documents the relevant context information and then the technical detail on how to do it. There is plenty of published documentation on how to do this integration, but I wanted to also introduce the two main mechanisms to do it (PolyBase and the COPY statement), and specifically do append of rows, instead of rewriting Synapse tables, which is what you find in most links.
- Use virtual network service endpoints and rules for servers in Azure SQL Database (Microsoft docs)
- This page covers the case of when you have your data lake (Azure Storage) with accesses restricted to a VNet, and the steps in the section "Azure Synapse PolyBase and COPY statement" actually also give you some relevant context: "PolyBase and the COPY statement is commonly used to load data into Azure Synapse Analytics from Azure Storage accounts for high throughput data ingestion. If the Azure Storage account that you are loading data from limits access only to a set of VNet-subnets, connectivity when using PolyBase and the COPY statement to the storage account will break". Down in the example, it goes into creating External tables, and that's not something you actually need to create when you do it from Databricks with the COPY statement.
- Azure Synapse Analytics (Databricks documentation)
- This is perhaps the most complete page in terms of explaining how this works, but also more complex. Again it refers PolyBase and the COPY statement, and includes code, but the code provided creates a new table, instead of adding to existing tables. If you use the provided code to write to an existing table, it will actually overwrite the table and its schema.
- Loading from Azure Data Lake Store Gen 2 into Azure Synapse Analytics (Azure SQL DW) via Azure Databricks (medium post)
- A good post, simpler to understand than the Databricks one, and including info on how use OAuth 2.0 with Azure Storage, instead of using the Storage Key. Again the code overwrites data/rewrites existing Synapse tables.
- Tutorial: Extract, transform, and load data by using Azure Databricks (Microsoft docs)
- Finally, this is a step-by-step tutorial of how to do the end-to-end process. It uses Scala instead of Python, and again overwrites the destination tables. I followed the steps from here, specifically the section "Load data into Azure Synapse", and then did some modifications.