This project shows the results of benchmarking Parquet’s Variant type through Apache Iceberg and Spark, and in the Parquet library alone.
The benchmarks are implemented in two PRs
| Project | PR | Title | Status |
|---|---|---|---|
| Iceberg | 15629 | Core, Spark: Add JMH benchmarks for Variants | Open |
| Parquet | 3452 | GH-3451. Add a JMH benchmark for variants | Merged |
I currently (May 2026) have other variant-related PRs up, for parquet: hardening reading, and for iceberg improving those benchmark numbers. The latest results do show improvements.
Yes, but
At the time of the writing of the initial document (10-04-2026) the benchmark results imply that it is faster to perform filtering on variant data stored in Avro in Iceberg + Spark queries than it is on data stored in Parquet -and that shredded variants are the worst.
The May 21 2026 findings show that with rowgroup filtering and a dataset and test quereies tuned to exclude entire rowgroups, numbers are better. There’s still a lot to be done.
Benchmarking Parquet Variants through Iceberg
| Benchmark Set | Results | Date |
|---|---|---|
| Iceberg Variant Benchmarks | Spark SQL Queries on Iceberg tables | 2026-04-09 |
| Iceberg + Predicate Pushdown | Iceberg benchmark with predicate pushdown | 2026-05-21 |
| Parquet | Parquet Variant Benchmarks | 2026-04-10 |
| Parquet Performance Graphs | Parquet Performance Graphs | 2026-04-23 |
| Experiments | Before/after comparison of changes |
This site is automatically regenerated when data is published to its GitHub repository benchmarking-variants.
The content is all published via GitHub Pages at steveloughran.github.io/benchmarking-variants/
hardened branch fixes the chart.js branch cryptographically and adds a run-secure.sh shell script which runs the report generator
in a macos sandbox with restricted file and network access.
Use this to compare JMH result JSON files. The unforked version just pulls down the latest version of a charting JS file from NPM, which is something nobody should ever do.