Monday, October 5, 2020

Adding data to Azure Synapse table from Azure Databricks

Recently I put together a prototype of using Python code in Azure Databricks to clean-up data and then efficiently insert it into Azure Synapse Analytics (previously known as Azure Data Warehouse) tables. This post documents the relevant context information and then the technical detail on how to do it. There is plenty of published documentation on how to do this integration, but I wanted to also introduce the two main mechanisms to do it (PolyBase and the COPY statement), and specifically do append of rows, instead of rewriting Synapse tables, which is what you find in most links.

Relevant links

  • Use virtual network service endpoints and rules for servers in Azure SQL Database (Microsoft docs)
    • This page covers the case of when you have your data lake (Azure Storage) with accesses restricted to a VNet, and the steps in the section "Azure Synapse PolyBase and COPY statement" actually also give you some  relevant context: "PolyBase and the COPY statement is commonly used to load data into Azure Synapse Analytics from Azure Storage accounts for high throughput data ingestion. If the Azure Storage account that you are loading data from limits access only to a set of VNet-subnets, connectivity when using PolyBase and the COPY statement to the storage account will break". Down in the example, it goes into creating External tables, and that's not something you actually need to create when you do it from Databricks with the COPY statement.
  • Azure Synapse Analytics (Databricks documentation)
    • This is perhaps the most complete page in terms of explaining how this works, but also more complex. Again it refers PolyBase and the COPY statement, and includes code, but the code provided creates a new table, instead of adding to existing tables. If you use the provided code to write to an existing table, it will actually overwrite the table and its schema.
  • Loading from Azure Data Lake Store Gen 2 into Azure Synapse Analytics (Azure SQL DW) via Azure Databricks (medium post)
    • A good post, simpler to understand than the Databricks one, and including info on how use OAuth 2.0 with Azure Storage, instead of using the Storage Key. Again the code overwrites data/rewrites existing Synapse tables.
  • Tutorial: Extract, transform, and load data by using Azure Databricks (Microsoft docs)
    • Finally, this is a step-by-step tutorial of how to do the end-to-end process. It uses Scala instead of Python, and again overwrites the destination tables. I followed the steps from here, specifically the section "Load data into Azure Synapse", and then did some modifications.

PolyBase and the COPY statement

The two fastest ways to insert data into Synapse tables are PolyBase and new COPY statement. PolyBase is older technology, and COPY is its intended and simpler replacement. It requires fewer permissions and it's simpler to use - with PolyBase you may need to use Synapse External Tables, for example, which you don't need to do with COPY.

The Databricks documentation page above includes an overview of the differences:

«In addition to PolyBase, the Azure Synapse connector supports the COPY statement. The COPY statement offers a more convenient way of loading data into Azure Synapse without the need to create an external table, requires fewer permissions to load data, and provides an improved performance for high-throughput data ingestion into Azure Synapse.» 

And right after it, it includes the code used to control whether to use PolyBase or COPY:

spark.conf.set("spark.databricks.sqldw.writeSemantics", "<write-semantics>")

Here you can use "polybase" or "copy" as parameters to control how to do the data writes into Synapse. If you leave the line out, as I did in my code below, the default when using Azure Storage Gen2 + Databricks Runtime > 7.0 is "copy".

Finally, I specifically wanted to do the full ETL pipeline from Databricks, but the data transfer could actually also be orchestrated from Azure Data Factory. This documentation page goes into detail on how to do that, using either of the above alternatives.

Code details

The example code I found typically overwrote any existing table, instead of adding records to it, plus most of the examples use PolyBase + External Data Tables. In my scenario, I have the tables already created in Synapse, the Spark data frames filled with processed data, and just need to insert the rows into the destinations tables. So what are the steps?

First thing, I used the code in the tutorial link above to set up auxiliary variables, just converted from Scala to Python. This could be much more compact, but I left the original format for clarity:

Setup code

And then simply insert the data in my data frame into the target table in Azure Synapse:

In the call above, the first parameter of every option() is a keyword, so you really just need to change the respective database connection string (the "url" value) and the name of the table (the "dbTable" value). Also note that the connection string is not actually used to send the data to Synapse, but to tell it to go fetch the data from Azure Storage gen2.

And that's it. When you run this you get a message from Spark saying "Waiting for Azure Synapse Analytics to load intermediate data from wasbs://..... into "my-table-in-db" using COPY", and the data shows up in the table. The code above does not require any External Tables, and by using append you add to the table, while overwrite creates a new table from scratch (including overwriting the schema with that of the Dataframe). 

One caveat to the above -- it's your responsibility to make sure the types you use in your Spark data frame match those that you used in your Azure Synapse table definition. As an example, I had some columns as IntegerType() in my Spark Data frame, which you can't insert into TINYINT/SMALLINT columns, so the append failed.

One final note - while some places of the documentation refer to COPY as being in preview, this actually came out of Preview on September 23rd.

Thursday, June 25, 2020

Using the VisionAI DevKit to capture photos to Azure regularly


I've been playing lately with several IoT devices, from the Raspberry Pi to the Jetson Nano, and one (very simple) pet project I wanted to set up was to have one of the these devices capturing photos and uploading them to an Azure Storage Account, and then do either time lapses, or just use them for security home monitoring.
The issue with these small devices is that they're fiddly - they need boxes, power, be correctly positioned, can be fragile, etc., so I decided to use my Vision AI DevKit for this.

The camera is sturdy and has a wide-angle lens, the position is adjustable and the base is heavy/stable, it has built-in Wifi, runs Azure IoT Edge and can be powered with USB-C. It also has built-in ML capabilities (it can run neural network models on a Qualcomm Snapdragon chip pretty fast), but I don't actually want to run any ML, just upload photos regularly. It does use more power than my Raspberry Pi Zero with a camera, that's the main downside.

For my time-lapse use case, I need photos regularly, while for the security one I want to make sure photos are uploaded as fast as they are taken (for which I assume both power and Wifi are on). For this reason I decided to not do any local processing, just upload the photos ASAP and process them in Azure later. I'd save bandwidth by doing processing on the camera, that's not really an issue.

Starting point

I started with one of the Community Projects available for the camera, the Intelligent Alarm by Microsoft colleague Marek Lani. His project is entirely on GitHub, and he has a more complex setup than what I need -- he's doing object recognition on the edge as a trigger for a photo upload, which I don't want to do. He actually has a repo on GitHub for the image capture part of his project: . The reason this is relevant is because he is capturing images from the camera using ffmpeg over the built-in RTSP video feed (and calling it from NodeJS), instead of using the SDK's capabilities to take photos. Doing this later option can mess up local ML processing and require a reboot of the camera. So my work was simplified to adapting his code for my scenario.

Code changes

- Modify the capture code

The first thing I did was to look at Marek's app.js file. His code captures a photo whenever a message is received from IoT Hub (more correctly, from another container/module running on the camera). I just commented all of this block and replaced it with a call to the function that calls ffmpeg to capture the photo, TakeAndUploadCaptureFromStream(); . In more detail, I commented out the pipeMessage function and the whole IoT Hub block starting with Client.fromEnvironment.

The second thing was to find a way to call this code regularly. The classical solution to do this is to use cron, and that's what I did, following some hints from this thread on Stackoverflow. So here are the steps:

- Created a cronjobs file with content:

* * * * * node /app/app.js > /dev/stdout

This setup means the command is called once per minute. The redirect is actually not working, I want it to redirect the output to the docker log, but I get an error when I use "> /proc/1/fd/1 2>/proc/1/fd/2". Something to come back to.

- Modified the Dockerfile.arm32v7 to contain:

FROM arm32v7/node:12-alpine3.11


RUN apk add  --no-cache ffmpeg

COPY package*.json ./

RUN npm install --production

COPY app.js ./

# copy crontabs for root user
COPY cronjobs /etc/crontabs/root

USER root

# start crond with log level 8 in foreground, output to stderr
CMD ["crond", "-f", "-d", "8"]

The changes were:
  • changed the base image to one with alpine (which contains cron)
  • use apk to install ffmpeg instead of apt-get
  • changed the startup command to run cronjobs.
And that was it. I already had a provisioned IoT Hub and the camera registered as an IoT Edge device, as well as an Azure Container Registry to host my container images, and a Storage Account to drop the photos in, so I just had to:
  1. Build the container (docker build)
  2. Tag it with the ACR URL (docker tag)
  3. Push to my ACR (docker push)
  4. Add a module to the edge device configuration (Azure Portal > IoT Hub > my IoT Edge device > Set Modules), remembering to specify the required environment variables: RTSP_IP, RTSP_PORT, RTSP_PATH, STORAGE_CONTAINER and AZURE_STORAGE_CONNECTION_STRING.
After giving a few minutes for the IoT Edge runtime to download and get the container running, my Azure Storage Account now shows the incoming photos:

And this is what is running on the device:

Which matches the configuration on Azure IoT Edge:

Next Steps

After this base setup, my next step is to trigger the execution of an Azure Function or Azure Logic App on a schedule to compare the last two images to check for deltas, or to check if there's a missing photo (indicating camera is possibly off) and then triggering an email alarm. I already have some code to do image processing on an Azure Function (GitHub repo here), which will help.

Hope this helps anyone, and thanks to Marek Lani for his work on the Intelligent Alarm sample.


Turns out I had to iron out a couple of glitches in the last few days. 

The first was this: after 6-7 hours of image capturing, the AIVisoinDevKitGetStartedModule module would stop working, the container would die and restarting it didn't change the situation. Because the capture module depends on the RTSP stream this exposes, it would also stop. The problem turns out was disk space -- something is filling up the /tmp folder with core* files. My first thought was to again use a cronjob, but cron is read-only in the device, so I went the manual way:

- created a file in folder /run/user/0, with this content:


while [ : ]
rm -f /var/volatile/tmp/core*
sleep 5m

Simply delete the core* files every 5 minutes. I then did a chmod +x on the file, and ran it in the background with ./ & . This is not perfect... I'll have to run this every time the device reboots, however.

The second change I made was to organize the files by folder and use the date (in format yyyymmdd-hhmmss) in its name. The changes here were in app.js and included:

- in function TakeAndUploadCaptureFromStream() use the following for the first four lines:

function TakeAndUploadCaptureFromStream()
  var rightnow = new Date();
  var folder = rightnow.toISOString().replace('-''').replace('-','').replace('T''-').replace(':''').replace(':','').split('.')[0]; // returns eg 20200627-112802

  //-rtsp_transport tcp  parameter needed to obtain all the packets properly
  var fileName = `${folder}.jpg`

and in uploadImageToBlob() modify this block to calculate the right folder name and use it:

if (!error) {
      var justdate = fileName.split('-')[0]; // returns eg 20200627 from 20200627-112802.jpg
      blobService.createBlockBlobFromLocalFile(storageContainerjustdate + "/" + fileNamefileNamefunction(errorresultresponse) {

With these changes, I now have a new folder per day, and the files have names like 20200627-120401.jpg, much simpler to read and understand.

Monday, June 8, 2020

Setting up and configuring the Nvidia Jetson Nano

I recently purchased a Nvidia Jetson Nano to experiment more with Edge ML and Azure IoT Edge, in particular for image/video processing, and in this post I'll describe my setup steps. There are some excellent tutorials on how to do this, but I thought I still wanted to put this up -- a) in case someone else bumps into the same issues I did, and b) if I want to repeat this setup in the future.

Packing list
Here's what I bought:
  1. NVIDIA Jetson Nano Developer Kit B01 (latest version of the Nano)
  2. GST25E05-P1J MeanWell power supply 5V DC 4,0A mit Hohlstecker / 5,5x2,1mm (to power the Nano, as is widely recommended)
  3. RPI-2CAM Raspberry Pi Camera Module 8MP v2.1
  4. Intel Dual Band Wireless-AC 8265 M.2 network adapter (also has Bluetooth)
  5. SD Card: Samsung microSDXC UHS-I Card / Evo Plus 64Gb 
I quickly realized I should have bought an IPEX MHF4 antenna -- the wireless adapter sits under the heat sink, and has poor Wifi connectivity. I'll buy this, together with a fan and perhaps an enclosure/box.

Hardware setup
  1. Install the Wifi module as per the instructions here: 
  2. Write the OS Image to the SD Card with Balena Etcher, as described here:
    • Note: I skipped the "SD Card Formatter" part, my card didn't require this
  3. Insert the SD Card in the Jetson Nano with the contacts facing up, as shown here:
After doing this base setup I connected the Jetson Nano to a monitor with HDMI, USB Keyboard and mouse, and plugged in the charger, only to find that it didn't turn on. I had to use the mini-USB to power it, which proved unstable -- the Jetson would just turn off after a few minutes. I'm not sure why this was, as I wasn't doing any compute, but I speculate it may be related to the large resolution of the monitor.

I eventually found out that there's a jumper you have to set that configures the device to either use the microUSB or the external charger for power. This is jumper J48, which on the B01 sits behind the HDMI output and the place where you plug the external charger. 

As it turns out there's a lot of information on this configuration, how power for the Jetson Nano works,  and how to do this. For example:
On the B01 the plastic piece is provided but it's just sitting in one pin, I just had to take it out and put it over both of them.

After doing this I was able to keep the Jetson Nano stably turned on, configured it using the start-up user interface, and successfully tested SSH. 

Before continuing, and while connected via SSH, I installed Jetson Stats, following the instructions here: . You'll be able to see interesting stats on the use of the chip's resources (RAM, CPU, etc.) by running sudo jtop.

Configuring remote connection
I mostly want to use the Jetson Nano headless, i.e., without having the keyboard/mouse/monitor connected to it. Getting this to work took me down a path of some pain. 

The first suggestion I tried was to use VNC with Open Vino. There is a text document on the Jetson Nano on how to do this, but these instructions don't work. Anyway, on the meantime I also found out that VNC requires you to have a running UI session on the device, and then you connect to it. This means that every time I shut down the device, I would have to plug-in the HDMI, keyboard, etc., log in and then disconnect. Scratch VNC.

After VNC I went the RDP route, and installed xrdp. Considering I mostly use Windows 10, Remote Desktop is the perfect option. I followed the steps described here , in section Enabling Remote  Desktop. By the way, this article is one of the best guides on how to set up the Jetson Nano I've found, if you want step-by-step clear instructions.

When trying RDP, however, the session would start, show the Nvidia logo, and close again. Turns out that (according to this ), there's an issue with Gnome, the desktop windows manager, and one of the libraries it uses. Which led me to just install xfce4, as suggested in the previous link. This solved the issue and I could now connect using remote desktop.

The story wasn't over however. When connected, I noticed the terminal console wasn't opening, and apparently this is because it was trying to use the Gnome terminal. This link gave me the solution, I had to install the xfce4-terminal and configure it: .

So to summarize:
  1. Install xrdp from here:
  2. Install xfce4 and configure it as the default window manager, from here:
  3. Install xfce4-terminal and set it as the default terminal, from
Done? Not yet.

Wifi Connection Drops / Wifi Power Save
While doing the previous steps, I noticed my connection to the Jetson was very unstable. The SSH kept dropping and pings would fail. Initially I thought this was due to the lack of antennas on my Wifi adapter, but turns out it was due to the fact that it was being put to sleep by the OS.

The first solution was to do what is suggested here: , to change a setting in file /etc/NetworkManager/conf.d/default-wifi-powersave-on.conf . This doesn't work.

The second solution was from here and run "sudo iw dev wlan0 set power_save off". This actually runs, but it's not persistent across reboots.

The third and final solution was to follow the steps here: . These instructions are for the Nvidia Jetson TX2 but also work with the Jetson Nano. Note that the instructions assume gedit is used to create files, but I used vim instead -- so replace as appropriate with whatever text editor you have installed.

I don't know if this issue is specific to the Wifi adapter I'm using, and if it also happens with USB Wifi adapters, but I found quite a few people asking for help with this and several of the previous challenges I had. Have to admit this wasn't the best user experience I've ever had.

So what's next? Setting up Azure IoT Edge and trying it with DeepStream: . I'll write about how it goes.

Tuesday, January 7, 2020 podcast -- João Pedro Martins – AI and Machine Learning in Organizations

Note: refers to a recording done in Portuguese.

Earlier in November I recorded a podcast for (available here) which focused on AI/ML and its impact in society and organizations.

AI/ML in the last few years has undergone a transformation, with both increasing hype but also increasing problems and doubts being raised. Algorithmic discrimination/bias, problems with "bad data", challenges in going live with ML solutions, incoming regulation, discussions on neural networks/connectionist versus/with symbolic approaches (or "neuro-symbolic"), issues raised by face recognition, autonomous robots/warfare, etc. The field is very much in flux, but it's clear to me there's some backlash and more caution and realism are coming to the field.

The good part in what seems to be happening, is that interesting applications are making their way into our lives and organizations, in many areas from industry to science to medicine, and sometimes in "minor" features in things like development tools (VS Code/Studio comes to mind) or Office. Maybe full level 5 autonomous driving is many years away, but the investment will help it progress and find new solutions. While this happens, the Research field (just see the latest NeurIPS) seems to be diverging from Industry -- the former is now tackling other types of problems that are very technical (not in a CompSci sense only, but also Mathematical/Statistical) and in many cases not related to Deep Neural Networks.

Anyway, all this to say that - when I reviewed the recording - I felt the more pessimistic side of what's happening in the industry took more time than the side of the possibilities/real impact the tech is having, which was annoying and doesn't actually reflect my thoughts.

If you understand Portuguese, hear it and let me know what you think.

To close off, a word of thanks to the team for inviting me and doing great work with the podcast!