I was curious as to the performance of this new support against regular Parquet, so adapted the notebook Databricks supported to include a test versus this format, spun up my Azure Databricks cluster (running two Standard_DS3_v2 VMs with 14.0 GB Memory, 4 Cores, 0.75 DBUs each) using Databricks Runtime 5.0.
The notebook with the Scala code is available here, and the results I got were:
Test | Avro | Parquet | Comparison |
---|---|---|---|
Read time (ms) | 28061 | 18131 | 65% |
Write time (ms) | 41342 | 33904 | 82% |
Disk space (mb) | 2138 | 2037 | 95% |
Parquet is the superior format in all three tests, but considering Avro is row-based and Parquet is columnar, I did expect - given the nature of the tests - for Avro to be most performant. Anyway, my goal was just to satisfy my curiosity about the performance differences at a high level, not compare the formats in general. For that, this deck is a couple of years old but has interesting information.
No comments:
Post a Comment