We explore the performance benefits of employing the runtime for Apache Spark and Apache Iceberg compared to running identical workloads with open-source Spark 3.5.1 on Iceberg tables. Iceberg is a widely utilized open-source solution for large-scale analytical data storage, offering high performance and efficiency in managing massive datasets. Our results show that Amazon EMR can accelerate TPC-DS 3TB workloads by a factor of 2.7, reducing the runtime from 1 hour and 54 minutes to just over 56 minutes. Furthermore, fee effectiveness increases by a factor of 2.2, as the total cost drops from $16.09 to $7.23 with the utilization of Amazon EC2’s On-Demand r5d.4xlarge instances, yielding substantial benefits for data processing tasks.
The solution provides a high-performance runtime environment while maintaining 100% API compatibility with open-source Apache Spark and Iceberg table formats. We previously outlined several optimizations, resulting in a four-fold speedup and an 2.8-times improvement in price-performance compared to open-source Spark 3.5.1 on the 3 TB TPC-DS benchmark. Despite this, many optimisations remain focused on DataSource V1, while Iceberg leverages Spark. As a result, we have focused on porting select existing optimisations from the EMR runtime for Spark to DataSource V2, while also introducing tailored improvements specifically designed for Iceberg. The enhancements build upon the Spark runtime’s advancements in question planning, physical plan operator enhancements, and optimizations leveraging Amazon S3 and the Java runtime. As part of our ongoing efforts to enhance performance, we have incorporated eight new incremental optimizations effective with the release of Amazon EMR 6.15 in 2023; these improvements are now standard in Amazon EMR 7.1, enabled by default. Within the scope of improvements, lie the following:
- Optimizing DataSource V2 in Spark:
- Dynamic filtering on non-partitioned columns
- Eradicating redundant broadcast hash joins
- Partial hash mixture pushdowns
- Bloom filter-based joins
- Iceberg-specific enhancements:
- Information prefetch
- Help for file size-based estimations
All four platforms, , , , and , utilize optimized runtimes. Consult with them directly for further details.
Benchmarked outcomes for Amazon EMR 7.1 versus traditional Hadoop distributions demonstrate significant improvements in performance and scalability. Apache Spark 3.5.1 and Apache Iceberg 1.5.2 provide a powerful combination for data engineering and analytics workloads.
We conducted a series of benchmark tests using the Spark engine’s efficiency in an Iceberg data format. Our results, based on the TPC-DS dataset, differ slightly from the official benchmarks due to varying setup configurations. Benchmarks have been conducted to evaluate the performance of Amazon EMR runtime configurations featuring Spark 3.5.0 and Iceberg 1.4.3-amzn-0 on EMR 7.1 clusters, as well as open-source Spark 3.5.1 and Iceberg 1.5.2 deployed on EC2 instances reserved for open-source testing purposes.
Please visit our website for the setup directions and technical particulars. To mitigate the impact of external catalogs like Hive, we leveraged Hadoop’s native catalog to manage our Iceberg tables. Because this application leverages the Amazon S3 file system, specifically to store and manage its comprehensive catalog. This setup can be outlined by configuring the property. spark.sql.catalog..kind
. The fact tables leveraged the default partitioning strategy based on the date column, resulting in numerous partitions spanning over a range of approximately 1,100 days from 200 to 2,100. The data presented does not rely on previously calculated statistics.
We executed 104 SparkSQL queries across three consecutive rounds, tracking the average run time of each query to facilitate meaningful comparisons. Amazon EMR 7.1, equipped with Iceberg functionality, achieved a common runtime of 0.56 hours across the three rounds, showcasing a notable 2.7-fold acceleration improvement over open-source Spark 3.5.1 and Iceberg 1.5.2. The following determines the entire runtimes in seconds.
The next desk summarises the metrics.
Common runtime in seconds | 2033.17 | 5575.19 |
Geometric calculations on large query sets occur within milliseconds. | 10.13153 | 20.34651 |
Price* | $7.23 | $16.09 |
*Detailed price estimates will be provided later in this article.*
This chart illustrates the significant per-query performance enhancements achieved by Amazon EMR 7.1 in comparison to open-source Spark 3.5.1 and Iceberg 1.5.2. The performance boost achieved by Amazon EMR in comparison to open-source Spark varies significantly across questions, ranging from a notable 9.6-fold acceleration for question 93 to a relatively modest 1.04-fold increase for question 34; notably, Amazon EMR consistently outperforms open-source Spark when utilizing Iceberg tables.
This plot presents a comprehensive view of the 3TB TPC-DS benchmark queries ordered by their corresponding efficiency enhancements when utilizing Amazon EMR, with the horizontal axis featuring the queries arranged in descending order based on these improvements. The vertical axis represents the magnitude of this speedup in seconds.
Price comparability
Our benchmark provides comprehensive runtime and geometric insight into evaluating the efficiency of Spark and Iceberg in a realistic, large-scale data processing scenario. To gain further perspectives, we also examine the financial dimension. We derive price estimates through a formula-driven approach that takes into account EC2 On-Demand instances, Amazon Elastic Block Store (EBS), and Amazon Elastic MapReduce (EMR) expenses.
- Amazon EC2 pricing (inclusive of SSD storage costs) equals a function of various job scenarios times the hourly rate for an r5d.4xlarge instance multiplied by the job’s processing time in hours.
- The hourly rate for 4xLarge instances is $1.152 per hour.
- Estimating the total cost of using Amazon Elastic Block Store (EBS): The cost is derived from a combination of factors including the hourly charges for each gigabyte (GB) of storage, the volume of root EBS allocated, and the job’s processing time in hours.
- What’s the estimated cost of processing large datasets using AWS? The answer lies in this concise formula: (various scenarios * $2.40 per hour) x (job runtime in hours).
- The cost of running a 4xlarge Amazon Elastic MapReduce (EMR) cluster is approximately $0.27 per hour.
- Total cost = Amazon EC2 price + Root Amazon EBS price + Amazon EMR price
The analysis indicates that deploying Amazon EMR 7.1 results in a 2.2 times more cost-effective performance compared to running open-source Spark 3.5.1 and Iceberg 1.5.2 on the same benchmark task, showcasing significant cost savings.
Runtime in hours | 0.564 | 1.548 |
Variety of EC2 situations | 9 | 9 |
Amazon EBS Dimension | 20gb | 20gb |
Amazon EC2 price | $5.85 | $16.05 |
Amazon EBS price | $0.01 | $0.04 |
Amazon EMR price | $1.37 | $0 |
Whole price | $7.23 | $16.09 |
Price financial savings | Amazon EMR 7.1 shows a remarkable 2.2 times performance boost | Baseline |
Amazon EMR 7.1 demonstrated significant improvements in data scanning efficiency, with Spark occasion logs indicating a 67% reduction in data scanned from Amazon S3 and a 44% decrease compared to the open-source model on the TPC-DS 3 TB benchmark. This improvement in Amazon S3 information scanning enables cost savings for Amazon EMR workloads by reducing prices.
Benchmarks for open-source Apache Spark are run against Iceberg tables to evaluate query performance. These tests gauge the efficiency of various operations, including read and write speeds, and data ingestion rates. The results provide valuable insights into the scalability and reliability of data warehousing solutions built atop Iceberg and Spark.
We utilised distinct EC2 clusters, each equipped with nine r5d.4xlarge instances, to test each open-source Spark 3.5.1, Iceberg 1.5.2, and Amazon EMR 7.1 separately? The initial node was configured with 16 vCPUs and 128 GB of memory, while the eight employee nodes together boasted a total of 128 vCPUs and 1TB of memory. To demonstrate typical user proficiency, we employed Amazon EMR’s default settings and made minimal adjustments to Spark and Iceberg configurations for a fair comparison.
The next configuration summarises the Amazon EC2 settings for the initial node and eight worker nodes, all utilising the r5d.4xlarge instance type.
16 | 128 | 2 x 300 NVMe SSD | 20 GB |
Conditions
To successfully execute the benchmarking process, several crucial prerequisites must be met.
- Reorganize the TPC-DS supply data to optimize storage efficiency, arranging it in a logical structure within both your Amazon S3 bucket and local laptop, ensuring easy access and retrieval.
- Duplicate the benchmark utility to your Amazon Simple Storage Service (S3) bucket by using the AWS CLI command aws s3 cp or aws s3 sync, as follows: Alternatively, you can store a copy of the data in your Amazon S3 bucket.
- CREATE TABLE customer_address (
c_cdemo_sk integer,
c_address_line1 varchar(1024),
c_city varchar(20),
c_region varchar(20),
c_postal_code char(10),
c_county varchar(50),
c_nation_key integer,
c_state_province varchar(40),
c_country varchar(3));CREATE TABLE customer_demographics (
c_customer_sk integer,
c_current_cust_age integer,
c_marital_status_group char(1),
c_income_group integer,
c_education_level char(1),
c_wed_marr_status char(1),
c_race_type char(1));CREATE TABLE nation (
n_nation_key integer,
n_region_key integer,
n_name varchar(50),
n_comment text);CREATE TABLE region (
r_region_key integer,
r_name varchar(20)); Create Iceberg tables using the Hadoop Catalog? The script utilizes an Amazon Elastic MapReduce (EMR) 7.1 cluster to create tables.
Identify the location of the Hadoop catalog warehouse as noted previously. We employ identical table schemas to execute benchmark tests on both Amazon EMR 7.1 and open-source Spark and Iceberg.
The benchmark utility was developed within the department. When building a novel benchmarking tool, redirect to the designated folder immediately upon obtaining the source code from the official GitHub repository.
To create and configure a YARN cluster on Amazon EC2, follow these steps: First, launch three t2.large instances in the same Availability Zone. Next, install Hadoop 2.x on each instance, configuring the `hadoop-env.sh` file to set the `HADOOP_HOME` variable. Then, format the HDFS by running the command `hdfs namenode -format`. After that, start the YARN and HDFS services by executing the commands `start-yarn.sh` and `start-dfs.sh`, respectively. Finally, verify the cluster’s status using the `yarn node -list` and `hadoop dfsadmin -report` commands.
To achieve identical iceberg efficiency between Amazon EMR on Amazon EC2 and open-source Spark on Amazon EC2, follow the instructions below to set up an open-source Spark cluster on Amazon EC2 using Flintrock with eight worker nodes.
Based primarily on the cluster selection for this purpose, the subsequent settings are employed.
Execute the TPC-DS benchmark using Apache Spark 3.5.1 in conjunction with Iceberg 1.5.2 to assess its performance capabilities?
To successfully execute the TPC-DS benchmark, proceed with the following sequential actions:
1. Ensure that your chosen database management system is properly installed and configured.
2. Download the official TPC-DS dataset from the Transaction Processing Performance Council (TPC) website and extract its contents to a designated directory.
3. Set up your test environment by allocating sufficient memory and processing power for optimal performance.
4. Install any necessary dependencies, such as Java or Python packages, depending on the chosen implementation language.
5. Update configuration files according to specific requirements outlined in the TPC-DS documentation, if applicable.
6. Execute the benchmarking script or application, carefully following the provided instructions to generate results and measure performance metrics.
7. Validate your findings by comparing them against established benchmarks and best practices within the database community.
8. Continuously monitor and fine-tune your setup as needed to optimize performance and eliminate potential bottlenecks.
- Log in to the Open Supply Cluster’s primary interface.
flintrock login $CLUSTER_NAME
. - Submit your Spark job:
- To select a suitable Iceberg catalog warehouse location and database for your created Iceberg tables, follow these steps:
Firstly, identify the warehouse location where your Iceberg tables are stored. Typically, this is defined in your Apache Hive metastore configuration file or overridden through environment variables.
Next, determine which database (e.g., MySQL, PostgreSQL) is used by your Iceberg catalog and warehouse. This information can usually be found in the same configuration file as before or via environment variables.
Lastly, verify that you have correctly configured the warehouse location and database in your Apache Hive metastore to match where your Iceberg tables are stored.
With this information, you should now have a proper understanding of how to select an appropriate Iceberg catalog warehouse location and database for your existing Iceberg tables.
- The outcomes are created in
s3:///benchmark_run
. - You can potentially track progress in various ways.
/media/ephemeral0/spark_run.log
.
- To select a suitable Iceberg catalog warehouse location and database for your created Iceberg tables, follow these steps:
Summarize the outcomes
After the Spark job completes, retrieve the results file from the output S3 bucket at s3:///benchmark_run/timestamp=xxxx/abstract.csv/xxx.csv
. The process of achieving this goal can be successfully executed either by accessing the Amazon S3 console and locating the target bucket, or via the AWS Command Line Interface (CLI), which enables users to manage their S3 resources programmatically. The Spark benchmark utility creates a timestamped directory and places an abstract file within a folder labelled abstract.csv
. The output CSV file contains four columns without headers.
- Question identify
- Median time
- Minimal time
- Most time
Based on data from three separate runs with one iteration each, we will compute both the arithmetic mean (typical average) and geometric mean of the benchmark runtimes.
The scalability of data processing engines is crucial in today’s Big Data landscape. To evaluate the performance of Amazon EMR (Elastic MapReduce) runtime for Apache Spark, I ran the TPC-DS (Teradata Performance Council’s Decision Support System) benchmark on this infrastructure.
SKIP
Most of the instructions closely mirror each other, differing only slightly in their treatment of a handful of Iceberg-related nuances.
Conditions
Full the next prerequisite steps:
- Run
aws configure
To set up the AWS Command Line Interface (CLI) shell so that it defaults to the benchmarking AWS account for testing purposes. Consult with for directions. - Upload the benchmark utility JAR file to Amazon S3 for seamless execution and monitoring of performance metrics.
Deploy the EMR cluster with optimized configurations for high-performance computing, leveraging Spark 3.x for enhanced scalability and faster processing times. Run the benchmark job on the freshly spun-up cluster, utilizing Hadoop’s distributed storage and processing capabilities to generate accurate performance metrics.
To successfully execute the benchmark job, proceed as follows:
- aws emr create-cluster –name MyCluster –release-label emr-6.7.0-latest –instance-type m4.large –num-instances 3 –use-default-roles –ec2-keyname my-keypair –subnet-id subnet-12345678 –security-group-ids sg-01234567,sg-90123456 –auto-terminate true Be certain to allow Iceberg. See for extra particulars. To identify a suitable Amazon Elastic MapReduce (EMR) model, consider the following factors: the type of workloads you’ll be processing, the size of your data, and the desired level of scalability and performance.
You should select an EMR model that provides sufficient resources to handle your workload efficiently, considering the root quantity dimension (e.g., number of nodes, instance types) and identical useful resource configuration as the open source Flintrock setup. Consult with experts to gain a thorough understanding of the various AWS CLI options, leveraging their extensive knowledge to navigate the diverse array of features and functionalities at your disposal.
- Retrieving the cluster ID from the response: 12345? We would like to proceed with this step.
-
The submission of a benchmark job on Amazon EMR utilises.
add-steps
from the AWS CLI:- Use the cluster ID obtained in Step 2.
- The benchmark utility is at
s3:///spark-benchmark-assembly-3.5.1.jar
. - The correct warehouse and database for the selected Iceberg catalog should be determined as follows: The Snowflake account name, the desired warehouse, and the desired database are all used to construct the fully qualified name of the Iceberg catalog, which takes the following format: ‘account_name’. ‘warehouse_name’.’database_name’; For instance:
‘acme_account’.’dev_warehouse’.’test_database’; This should remain identical to the original version used for the open-source TPC-DS benchmark run.
- The outcomes shall be in
s3:///benchmark_run
.
Summarize the outcomes
After the step is complete, you’ll be able to view a concise summary of the benchmark’s outcome at s3:///benchmark_run/timestamp=xxxx/abstract.csv/xxx.csv
Identically replicating the previous run, we calculate the mean and harmonic average of the query runtime.
Clear up
To avoid any future fees, promptly delete the assets you created as per the instructions provided in the relevant documentation.
Abstract
Amazon EMR persistently enhances the EMR runtime for Spark when used with Iceberg tables, achieving a speed-up of 2.7 times faster than open-source Spark 3.5.1 and Iceberg 1.5.2 on TPC-DS 3 TB v2.13. To fully leverage ongoing improvements in efficiency, we recommend staying current with the latest Amazon EMR releases.
To stay informed, consider subscribing to the AWS Large Information Weblog, where you’ll discover updates on EMR runtime for Spark and Iceberg, along with tips on configuration best practices and tuning recommendations.
Concerning the authors
Serves as a software program growth engineer for Amazon EMR at Amazon Web Services.
Serves as an Engineering Supervisor at EMR, a key organization within Amazon Network Companies.