Tuesday, April 1, 2025

The Amazon EMR 7.5 runtime significantly accelerates Apache Spark and Iceberg performance, enabling Spark workloads to execute up to 3.6 times faster than the previous versions (Spark 3.5.3 and Iceberg 1.6.1).

The solution provides a high-performance runtime environment that ensures 100% API compatibility with open-source Apache Spark and supports Apache Iceberg’s table format, guaranteeing seamless integration with existing applications. All four, namely, , , and , employ optimized runtimes.

This publication showcases the performance benefits of leveraging the 7.5 runtime for both Spark and Iceberg, as compared to open-source Spark 3.5.3 with Iceberg 1.6.1 tables on the TPC-DS 3TB benchmark v2.13.

The Iceberg project offers a popular open-source solution for efficiently storing and processing large analytical datasets in a high-performance format. Amazon EMR’s benchmarks demonstrate a substantial performance boost, achieving a 3.6-fold acceleration in processing TPC-DS 3TB workloads, reducing runtime from 1 hour and 31 minutes to just under 27 minutes. The associated fee effectiveness is boosted by a significant 2.9 times, with the total cost decreasing from $16.00 to an affordable $5.39 when utilizing Amazon EC2’s On-Demand r5d.4xlarge instances, resulting in tangible benefits for data-intensive processing tasks?

This can result in an additional 32% enhancement from optimizations introduced in Amazon EMR 7.1 as described earlier? Since our initial update, we’ve further enhanced support for DataSource V2 by introducing eight additional optimizations for Spark in the EMR runtime, providing extra assistance to users.

With the latest advancements in DataSource V2, our team has further fine-tuned Spark operators on Amazon EMR 7.1, resulting in a notable boost in performance.

Benchmarks reveal that Amazon EMR 7.5 outperforms open-source alternatives, showcasing a 2.7x speedup over Apache Spark 3.5.3 in data processing tasks, with an average run time of 11 minutes compared to 30 minutes for Spark. Furthermore, EMR 7.5 demonstrates a 4.8x gain over Iceberg 1.6.1 in query performance, executing complex queries in just 2.5 hours as opposed to the 12 hours required by Iceberg.

We assessed the Spark engine’s efficacy through benchmarking exercises in the Iceberg dashboard format, leveraging the TPC-DS dataset, with our results differing from official benchmarks due to setup disparities. Benchmarks were conducted to evaluate the performance of Amazon Elastic MapReduce (EMR) runtime for Apache Spark and Apache Iceberg, comparing results from Amazon EMR 7.5 EC2 clusters with those from open-source Spark 3.5.3 and Iceberg 1.6.1 running on EC2 instances.

All setup directions and technical particulars are available within our. To mitigate the impact of external catalogs such as Hive, we leveraged the Hadoop catalog for our Iceberg tables. This utilizes the underlying file system, specifically Amazon S3, as it pertains to the catalog. We’re able to outline this setup by configuring the property? spark.sql.catalog.<catalog_name>.sort. The fact tables leveraged the default partitioning scheme based on the date column, spanning a range of partitions from 200 to 2100. The data presented in these tables has not relied on precomputed statistics.

We executed a total of 104 SparkSQL queries across three consecutive rounds, tracking the average execution time per query to enable meaningful comparisons. Amazon EMR 7.5, with Iceberg enabled, realised a typical runtime of 0.42 hours for the three rounds, showcasing a remarkable 3.6-fold acceleration compared to open-source Spark 3.5.3 and Iceberg 1.6.1. The following determines the full runtimes in seconds.

EMR vs OSS runtime

The next desk summarises the metrics.

Common runtime in seconds 1535.62 2033.17 5546.16
Geometrical implications are computed swiftly, with query responses provided within seconds. 8.30046 10.13153 20.40555
Value* $5.39 $7.18 $16.00

Detailed price estimates will be provided later in this publication.

The accompanying chart illustrates the impressive per-query performance enhancements achieved by Amazon EMR 7.5 in comparison with open-source alternatives, namely Spark 3.5.3 and Iceberg 1.6.1. While speedups differ significantly across questions, ranging from modest to substantial improvements, the most notable instance is q93, where Amazon EMR accelerates processing by up to 9.4 times faster than open-source Spark when utilizing Iceberg tables. The graph plots the 3TB TPC-DS benchmark queries along the horizontal axis, sorted by the notable efficiency enhancements achieved through Amazon EMR, while the vertical axis illustrates the magnitude of these improvements as a ratio to scale.

EMR vs OSS per query cost

Value comparability

Our benchmark provides comprehensive runtime and geometric implications, enabling efficient evaluation of Spark and Iceberg’s performance in a realistic, large-scale scenario. To gain deeper understanding, our analysis also encompasses the financial implications. We use a proprietary formula to estimate prices, taking into account various factors such as EC2 On-Demand scenarios, Amazon EBS, and Amazon EMR billing.

  • Amazon EC2 price calculation is contingent upon a multitude of factors * the hourly rate for an r5d.4xlarge instance * actual job runtime measured in hours.
    • The Amazon R5d.4xlarge instance in the US East (N. Virginia) region has an hourly fee of $1,152 per hour.
  • Total estimated cost of running an EC2 instance with an Amazon Elastic Block Store (EBS) volume = combination of factors * hourly price of EBS storage per gigabyte * size of the EBS volume in gigabytes * number of hours the instance is running.
  • Amazon EMR pricing is calculated by multiplying the instance type (r5d.4xlarge in this case) by the job runtime in hours.
    • The hourly cost of a 4xlarge instance in Amazon EMR is approximately $0.27.
  • Total cost = Amazon EC2 pricing + Root Amazon EBS pricing + Amazon EMR pricing

The calculations show a significant 2.9-fold improvement in price effectiveness when running the Amazon EMR 7.5 benchmark compared to open-source alternatives, such as Spark 3.5.3 and Iceberg 1.6.1, on the same benchmark job.

Runtime in hours 0.426 0.564 1.540

Variety of EC2 situations

(Contains major node)

9 9 9
Amazon EBS Dimension 20gb 20gb 20gb

Amazon EC2

(Whole runtime price)

$4.35 $5.81 $15.97
Amazon EBS price $0.01 $0.01 $0.04
Amazon EMR price $1.02 $1.36 $0
Whole price $5.38 $7.18 $16.01
Value financial savings Amazon EMR 7.5 is approximately 2.9 times more efficient than its predecessors. Amazon EMR 7.1 is approximately 2.2 times more scalable than its predecessor. Baseline

According to Spark occasion logs, Amazon EMR demonstrated a significant reduction in data scanning: approximately 3.4 times fewer records were extracted from Amazon S3 compared to previous metrics, while 4.1 times less information was processed than the open-source model in the TPC-DS 3 TB benchmark. This feature in Amazon S3 knowledge scanning results in significant cost reductions for Amazon EMR workloads.

Spark SQL performance benchmark for Apache Iceberg tables requires a robust setup to ensure accurate results. We utilize the open-source Big Data Spark benchmark framework to execute the tests on various data sets and configurations. The comprehensive suite of tests verifies the performance of Spark SQL operations, including scans, joins, aggregations, and more, when applied to Iceberg tables.

To evaluate the performance of open-source Spark 3.5.3 and Amazon EMR 7.5 for an Iceberg workload, we deployed separate EC2 clusters, each consisting of nine r5d.4xlarge instances. The initial node was equipped with 16 vCPUs and 128 GB of memory, while the eight employee nodes together possessed 128 vCPUs and 1024 GB of memory. To ensure a fair comparison, we conducted exams using the standard Amazon EMR settings, making only minor adjustments to Spark and Iceberg configurations to reflect typical user expertise.

The following table summarizes the Amazon EC2 configurations for the initial master node and eight worker nodes of type r5d.4xlarge.

16 128 2 x 300 NVMe SSD 20 GB

Stipulations

To execute the benchmarking process successfully.

  1. Download and organize the TPC-DS data sets according to the provided instructions, storing them in both your Amazon S3 bucket and on your local machine for future reference.
  2. Create a benchmark software according to provided instructions, then replicate the benchmark software to your Amazon Simple Storage Service (S3) bucket. Alternatively, you can copy the file directly to your Amazon Simple Storage Service (S3) bucket.
  3. SELECT
    D_KD,
    COUNT(DISTINCT O_OI_KEY) AS Num_Orders,
    SUM(O_TOTAL_ORDER_COST) AS Total_Order_Cost
    FROM
    orders
    WHERE
    O_SOLD_DATE >= DATE(‘1997-01-01’) AND O_SOLD_DATE < DATE('2001-01-01') GROUP BY D_KD; Create and use Iceberg tables in your Hadoop ecosystem by leveraging the Apache Iceberg metadata management tool? For instance, the code leverages an Amazon EMR 7.5 cluster to create tables.
aws emr add-steps  --cluster-id <cluster-id> --steps Sort=Spark,Title="Create Iceberg Tables", Args=[--class,com.amazonaws.eks.tpcds.CreateIcebergTables,--conf,spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions, --conf,spark.sql.catalog.hadoop_catalog=org.apache.iceberg.spark.SparkCatalog, --conf,spark.sql.catalog.hadoop_catalog.type=hadoop, --conf,spark.sql.catalog.hadoop_catalog.warehouse=s3://<bucket>/<warehouse_path>/, --conf,spark.sql.catalog.hadoop_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO, s3://<bucket>/<jar_location>/spark-benchmark-assembly-3.5.3.jar,s3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned/, /home/hadoop/tpcds-kit/tools,parquet,3000,true,<database_name>,true,true],ActionOnFailure=CONTINUE --region <AWS area>

Observe the Hadoop catalog warehouse location and database title, as previously determined. We utilize the same iceberg tables to conduct benchmarks using Amazon EMR version 7.5 and open-source Apache Spark.

The benchmark software was developed by the department. When building a novel benchmark software, proceed to the relevant department following the download of the source code from the GitHub repository.

To create and configure a YARN (Yet Another Resource Negotiator) cluster on Amazon EC2, follow these steps:

Step 1: Launch the necessary number of Amazon EC2 instances in different Availability Zones to ensure high availability.

Step 2: Install Hadoop on each instance by running the installation script.

To ensure identical performance between Amazon EMR on Amazon EC2 and open-source Spark on Amazon EC2, follow these instructions to set up a Spark cluster on EC2 using Flintrock with eight worker nodes.

Based on the selected clusters for this study, the following configurations were employed:

What are you referring to that needs improvement? <non-public ip of major node>, within the yarn-site.xml File containing the primary node’s IP address for your Flintrock cluster.

Execute a comprehensive performance evaluation of the TPC-DS benchmark on Spark 3.5.3 in conjunction with Iceberg 1.6.1, leveraging the unique capabilities of these cutting-edge technologies to drive meaningful insights from large-scale data warehouses.

To execute the TPC-DS benchmark, follow these subsequent steps:

  1. Login to the open-source cluster’s master node using a secure shell (SSH) client. flintrock login $CLUSTER_NAME.
  2. Submit your Spark job:
    1. To determine the appropriate Iceberg catalog warehouse location and database that contains the created Iceberg tables, please refer to the `catalog` table within your Iceberg metadata store.

      You can query the `catalog` table using SQL to retrieve a list of all available warehouses along with their respective databases and locations.

    2. The outcomes are created in s3://<YOUR_S3_BUCKET>/benchmark_run.
    3. You can track progress in real-time with our intuitive analytics tool. /media/ephemeral0/spark_run.log.
spark-submit  --master yarn  --deploy-mode shopper  --class com.amazonaws.eks.tpcds.BenchmarkSQL  --conf spark.driver.cores=4  --conf spark.driver.reminiscence=10g  --conf spark.executor.cores=16  --conf spark.executor.reminiscence=100g  --conf spark.executor.situations=8  --conf spark.community.timeout=2000  --conf spark.executor.heartbeatInterval=300s  --conf spark.dynamicAllocation.enabled=false  --conf spark.shuffle.service.enabled=false  --conf spark.hadoop.fs.s3a.aws.credentials.supplier=com.amazonaws.auth.InstanceProfileCredentialsProvider  --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem  --conf spark.jars.packages=org.apache.hadoop:hadoop-aws:3.3.4,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1,org.apache.iceberg:iceberg-aws-bundle:1.6.1  --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions    --conf spark.sql.catalog.native=org.apache.iceberg.spark.SparkCatalog     --conf spark.sql.catalog.native.sort=hadoop   --conf spark.sql.catalog.native.warehouse=s3a://<YOUR_S3_BUCKET>/<warehouse_path>/  --conf spark.sql.defaultCatalog=native    --conf spark.sql.catalog.native.io-impl=org.apache.iceberg.aws.s3.S3FileIO    spark-benchmark-assembly-3.5.3.jar    s3://<YOUR_S3_BUCKET>/benchmark_run 3000 1 false   q1-v2.13,q10-v2.13,q11-v2.13,q12-v2.13,q13-v2.13,q14a-v2.13,q14b-v2.13,q15-v2.13,q16-v2.13, q17-v2.13,q18-v2.13,q19-v2.13,q2-v2.13,q20-v2.13,q21-v2.13,q22-v2.13,q23a-v2.13,q23b-v2.13, q24a-v2.13,q24b-v2.13,q25-v2.13,q26-v2.13,q27-v2.13,q28-v2.13,q29-v2.13,q3-v2.13,q30-v2.13, q31-v2.13,q32-v2.13,q33-v2.13,q34-v2.13,q35-v2.13,q36-v2.13,q37-v2.13,q38-v2.13,q39a-v2.13, q39b-v2.13,q4-v2.13,q40-v2.13,q41-v2.13,q42-v2.13,q43-v2.13,q44-v2.13,q45-v2.13,q46-v2.13, q47-v2.13,q48-v2.13,q49-v2.13,q5-v2.13,q50-v2.13,q51-v2.13,q52-v2.13,q53-v2.13,q54-v2.13, q55-v2.13,q56-v2.13,q57-v2.13,q58-v2.13,q59-v2.13,q6-v2.13,q60-v2.13,q61-v2.13,q62-v2.13, q63-v2.13,q64-v2.13,q65-v2.13,q66-v2.13,q67-v2.13,q68-v2.13,q69-v2.13,q7-v2.13,q70-v2.13, q71-v2.13,q72-v2.13,q73-v2.13,q74-v2.13,q75-v2.13,q76-v2.13,q77-v2.13,q78-v2.13,q79-v2.13, q8-v2.13,q80-v2.13,q81-v2.13,q82-v2.13,q83-v2.13,q84-v2.13,q85-v2.13,q86-v2.13,q87-v2.13, q88-v2.13,q89-v2.13,q9-v2.13,q90-v2.13,q91-v2.13,q92-v2.13,q93-v2.13,q94-v2.13,q95-v2.13, q96-v2.13,q97-v2.13,q98-v2.13,q99-v2.13,ss_max-v2.13     true <database> > /media/ephemeral0/spark_run.log 2>&1 &!

Summarize the outcomes

After the Spark job completes, retrieve and review the results file from the output S3 bucket at. s3://<YOUR_S3_BUCKET>/benchmark_run/timestamp=xxxx/abstract.csv/xxx.csv. Both options are available: accessing via the Amazon S3 console by locating the desired bucket, and utilizing the AWS Command Line Interface (CLI) for a more streamlined process. The Spark benchmark software organises data effectively by creating a timestamped directory and placing an abstract file within a folder designated. abstract.csv. The output CSV records do not include headers:

  • Question title
  • Median time
  • Minimal time
  • Most time

Based on the information from three separate takes, with one iteration each time, we will calculate the mean and geometric mean of the benchmark runtimes.

The TPC-DS query 2 execution on the AWS EMR runtime for Spark with Hive Metastore and Parquet storage, as well as Amazon S3 data storage is significantly faster when compared to executing the same query using Hadoop Distributed File System (HDFS) and text files.

The majority of the instructions are straightforward and comparable to others, with merely a few Iceberg-specific nuances.

Stipulations

Full the next prerequisite steps:

  1. Run aws configure to set up the AWS CLI environment for a benchmarking AWS account at its standard configuration level. Seek guidance from experienced professionals.
  2. Upload the benchmark software JAR file to Amazon S3 storage.

Configure and launch the EMR cluster, then execute the benchmark job to thoroughly test its performance.

Complete the following steps to execute the benchmark job:

  1. aws emr create-cluster –release-label emr-6.7.0 –instance-type m4.xlarge –instance-count 3 –ec2-keyname my-key-pair –configurations ‘{“Name”:”spark-defaults”,”Properties”:{“spark.driver.maxResultSize”: “10m”}}’ Be sure to allow Iceberg. See for extra particulars. To accurately select an Amazon EMR model, root quantity measurement, and resource configuration for a Flintrock setup, consider the following: What are your options for configuring the AWS CLI?
  2. Retrieve the cluster ID from the response. We would like to implement this feature for the upcoming release.
  3. The Apache Spark application was deployed on Amazon Elastic MapReduce (EMR) by executing a single command: `emr create-jar –name my-spark-job –class myMainClass path/to/my/Spark.jar`? add-steps from the AWS CLI:
    1. Exchange <cluster ID> with the cluster ID from Step 2.
    2. The benchmark software is at s3://<your-bucket>/spark-benchmark-assembly-3.5.3.jar.
    3. To select the proper Iceberg catalog warehouse location and database that has the created Iceberg tables, you should first identify the correct AWS Glue or Hive metastore instance where your Iceberg metadata is stored.

      You can determine this by checking the ‘location’ property of the catalog object in your Iceberg configuration file, typically named ‘catalog.properties’. If the location points to a specific S3 bucket or HDFS directory, it should contain the tables you created with Iceberg. This must be the exact copy because the one utilized as is in the open-source TPC-DS benchmark run.

    4. The outcomes will likely be influenced by the complex interplay between various factors. s3://<your-bucket>/benchmark_run.
aws emr add-steps   --cluster-id <cluster-id> --steps Sort=Spark,Title="SPARK Iceberg EMR TPCDS Benchmark Job", Args=[--class,com.amazonaws.eks.tpcds.BenchmarkSQL, --conf,spark.driver.cores=4, --conf,spark.driver.memory=10g, --conf,spark.executor.cores=16, --conf,spark.executor.memory=100g, --conf,spark.executor.instances=8, --conf,spark.network.timeout=2000, --conf,spark.executor.heartbeatInterval=300s, --conf,spark.dynamicAllocation.enabled=false, --conf,spark.shuffle.service.enabled=false, --conf,spark.sql.iceberg.data-prefetch.enabled=true, --conf,spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions, --conf,spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog, --conf,spark.sql.catalog.local.type=hadoop, --conf,spark.sql.catalog.local.warehouse=s3://<your-bucket>/<warehouse-path>, --conf,spark.sql.defaultCatalog=local, --conf,spark.sql.catalog.local.io-impl=org.apache.iceberg.aws.s3.S3FileIO, s3://<your-bucket>/spark-benchmark-assembly-3.5.3.jar, s3://<your-bucket>/benchmark_run,3000,1,false, 'q1-v2.13,q10-v2.13,q11-v2.13,q12-v2.13,q13-v2.13,q14a-v2.13,q14b-v2.13,q15-v2.13,q16-v2.13,q17-v2.13,q18-v2.13,q19-v2.13,q2-v2.13,q20-v2.13,q21-v2.13,q22-v2.13,q23a-v2.13,q23b-v2.13,q24a-v2.13,q24b-v2.13,q25-v2.13,q26-v2.13,q27-v2.13,q28-v2.13,q29-v2.13,q3-v2.13,q30-v2.13,q31-v2.13,q32-v2.13,q33-v2.13,q34-v2.13,q35-v2.13,q36-v2.13,q37-v2.13,q38-v2.13,q39a-v2.13,q39b-v2.13,q4-v2.13,q40-v2.13,q41-v2.13,q42-v2.13,q43-v2.13,q44-v2.13,q45-v2.13,q46-v2.13,q47-v2.13,q48-v2.13,q49-v2.13,q5-v2.13,q50-v2.13,q51-v2.13,q52-v2.13,q53-v2.13,q54-v2.13,q55-v2.13,q56-v2.13,q57-v2.13,q58-v2.13,q59-v2.13,q6-v2.13,q60-v2.13,q61-v2.13,q62-v2.13,q63-v2.13,q64-v2.13,q65-v2.13,q66-v2.13,q67-v2.13,q68-v2.13,q69-v2.13,q7-v2.13,q70-v2.13,q71-v2.13,q72-v2.13,q73-v2.13,q74-v2.13,q75-v2.13,q76-v2.13,q77-v2.13,q78-v2.13,q79-v2.13,q8-v2.13,q80-v2.13,q81-v2.13,q82-v2.13,q83-v2.13,q84-v2.13,q85-v2.13,q86-v2.13,q87-v2.13,q88-v2.13,q89-v2.13,q9-v2.13,q90-v2.13,q91-v2.13,q92-v2.13,q93-v2.13,q94-v2.13,q95-v2.13,q96-v2.13,q97-v2.13,q98-v2.13,q99-v2.13,ss_max-v2.13', true,<database>],ActionOnFailure=CONTINUE --region <aws-region>

Summarize the outcomes

After the step is complete, you’ll be able to view a concise summary of the benchmark outcomes immediately. s3://<YOUR_S3_BUCKET>/benchmark_run/timestamp=xxxx/abstract.csv/xxx.csv Identical to previous runs, calculate both common mean and geometric mean of query execution times.

Clear up

To preclude potential price increases, promptly purge the assets you’ve developed according to the guidelines provided in the file.

Abstract

Amazon EMR continuously optimizes the EMR runtime for Spark when used with Iceberg tables, achieving a remarkable efficiency – approximately 3.6 times faster than open-source Spark 3.5.3 and Iceberg 1.6.1 on EMR 7.5, leveraging TPC-DS 3 TB v2.13 to verify this improvement. This upgrade offers an enhanced performance boost of approximately 32 percent compared to the existing EMR 7.1 iteration. We strongly recommend staying up-to-date with Amazon EMR’s latest releases to fully capitalise on continuous performance improvements.

To stay current, subscribe to the AWS Large Information Weblog, where you’ll find updates on the EMR runtime for Spark and Iceberg, as well as guidance on configuration best practices and optimization tips.


In regards to the Authors

Serves as a software program growth engineer for Amazon EMR at Amazon Internet Services.

Serves as a seasoned Engineering Supervisor for EMR at Amazon Internet Services.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles