I was curious as to the performance of this new support against regular Parquet, so adapted the notebook Databricks supported to include a test versus this format, spun up my Azure Databricks cluster (running two Standard_DS3_v2 VMs with 14.0 GB Memory, 4 Cores, 0.75 DBUs each) using Databricks Runtime 5.0.
The notebook with the Scala code is available here, and the results I got were:
|Read time (ms)||28061||18131||65%|
|Write time (ms)||41342||33904||82%|
|Disk space (mb)||2138||2037||95%|
Parquet is the superior format in all three tests, but considering Avro is row-based and Parquet is columnar, I did expect - given the nature of the tests - for Avro to be most performant. Anyway, my goal was just to satisfy my curiosity about the performance differences at a high level, not compare the formats in general. For that, this deck is a couple of years old but has interesting information.