Saturday, December 1, 2018

Spark 2.4 - Avro vs Parquet

A few days ago Databricks posted this article announcing "Apache Avro as a Built-in Data Source in Apache Spark 2.4", and comparing the performance against the previous version of the Avro format support.

I was curious as to the performance of this new support against regular Parquet, so adapted the notebook Databricks supported to include a test versus this format, spun up my Azure Databricks cluster (running two Standard_DS3_v2 VMs with 14.0 GB Memory, 4 Cores, 0.75 DBUs each) using Databricks Runtime 5.0.

The notebook with the Scala code is available here, and the results I got were:

Test Avro Parquet Comparison
Read time (ms) 28061 18131 65%
Write time (ms) 41342 33904 82%
Disk space (mb) 2138 2037 95%

Parquet is the superior format in all three tests, but considering Avro is row-based and Parquet is columnar, I did expect - given the nature of the tests - for Avro to be most performant. Anyway, my goal was just to satisfy my curiosity about the performance differences at a high level, not compare the formats in general. For that, this deck is a couple of years old but has interesting information.

No comments:

Post a Comment