Tuesday, April 1, 2025

Speed up question efficiency with Apache Iceberg statistics on the AWS Glue Knowledge Catalog

As we speak, we’re happy to announce a brand new functionality for the AWS Glue Knowledge Catalog: producing column-level aggregation statistics for Apache Iceberg tables to speed up queries. These statistics are utilized by cost-based optimizer (CBO) in Amazon Redshift Spectrum, leading to improved question efficiency and potential value financial savings.

Apache Iceberg is an open desk format that gives the aptitude of ACID transactions in your information lakes. It’s designed to course of massive analytics datasets and is environment friendly for even small row-level operations. It additionally allows helpful options similar to time-travel, schema evolution, hidden partitioning, and extra.

AWS has invested in service integration with Iceberg to allow Iceberg workloads based mostly on buyer suggestions. One instance is the AWS Glue Knowledge Catalog. The Knowledge Catalog is a centralized repository that shops metadata about your group’s datasets, making the info seen, searchable, and queryable for customers. The Knowledge Catalog helps Iceberg tables and tracks the desk’s present metadata. It additionally permits computerized compaction of particular person small information produced by every transactional write on tables into just a few massive information for quicker learn and scan operations.

In 2023, the Knowledge Catalog introduced help for column-level statistics for non-Iceberg tables. That characteristic collects desk statistics utilized by the question engine’s CBO. Now, the Knowledge Catalog expands this help to Iceberg tables. The Iceberg desk’s column statistics that the Knowledge Catalog generates are based mostly on Puffin Spec and saved on Amazon Easy Storage Service (Amazon S3) with different desk information. This manner, numerous engines supporting Iceberg can make the most of and replace them.

This put up demonstrates how column-level statistics for Iceberg tables work with Redshift Spectrum. Moreover, we showcase the efficiency advantage of the Iceberg column statistics with the TPC-DS dataset.

How Iceberg desk’s column statistics works

AWS Glue Knowledge Catalog generates desk column statistics utilizing the Theta Sketch algorithm on Apache DataSketches to estimate the variety of distinct values (NDV) and shops them in Puffin file.

For SQL planners, NDV is a crucial statistic to optimize question planning. There are just a few eventualities the place NDV statistics can doubtlessly optimize question efficiency. For instance, when becoming a member of two tables on a column, the optimizer can use the NDV to estimate the selectivity of the be part of. If one desk has a low NDV for the be part of column in comparison with the opposite desk, the optimizer could select to make use of a broadcast be part of as an alternative of a shuffle be part of, lowering information motion and enhancing question efficiency. Furthermore, when there are greater than two tables to be joined, the optimizer can estimate the output measurement of every be part of and plan the environment friendly be part of order. Moreover, NDV can be utilized for numerous optimizations similar to group by, distinct, and rely question.

Nonetheless, calculating NDV repeatedly with 100% accuracy requires O(N) area complexity. As a substitute, Theta Sketch is an environment friendly algorithm that lets you estimate the NDV in a dataset while not having to retailer all of the distinct values on reminiscence and storage. The important thing concept behind Theta Sketch is to hash the info into a spread between 0–1, after which choose solely a small portion of the hashed values based mostly on a threshold (denoted as θ). By analyzing this small subset of knowledge, the Theta Sketch algorithm can present an correct estimate of the NDV within the authentic dataset.

Iceberg’s Puffin file is designed to retailer data similar to indexes and statistics as a blob kind. One of many consultant blob sorts that may be saved is apache-datasketches-theta-v1, which is serialized values for estimating the NDV utilizing the Theta Sketch algorithm. Puffin information are linked to a snapshot-id on Iceberg’s metadata and are utilized by the question engine’s CBO to optimize question plans.

Leverage Iceberg column statistics by Amazon Redshift

To exhibit the efficiency advantage of this functionality, we make use of the industry-standard TPC-DS 3 TB dataset. We evaluate the question efficiency with and with out Iceberg column statistics for the tables by operating queries in Redshift Spectrum. We now have included the queries used on this put up, and we advocate making an attempt your personal queries by following the workflow.

The next is the general steps:

  1. Run AWS Glue Job that extracts TPS-DS dataset from Public Amazon S3 bucket and saves them as an Iceberg desk in your S3 bucket. AWS Glue Knowledge Catalog shops these tables’ metadata location. Question these tables utilizing Amazon Redshift Spectrum.
  2. Generate column statistics: Make use of the improved capabilities of AWS Glue Knowledge Catalog to generate column statistics for every tables. It generates puffin information storing Theta Sketch.
  3. Question with Amazon Redshift Spectrum: Consider the efficiency advantage of column statistics on question efficiency by using Amazon Redshift Spectrum to run queries on the dataset.

The next diagram illustrates the structure.

Speed up question efficiency with Apache Iceberg statistics on the AWS Glue Knowledge Catalog

To do this new functionality, we full the next steps:

  1. Arrange assets with AWS CloudFormation.
  2. Run an AWS Glue job to create Iceberg tables for the 3TB TPC-DS dataset in your S3 bucket. The Knowledge Catalog shops these tables’ metadata location.
  3. Run queries on Redshift Spectrum and observe the question length.
  4. Generate Iceberg column statistics for Knowledge Catalog tables.
  5. Run queries on Redshift Spectrum and evaluate the question length with the earlier run.
  6. Optionally, schedule AWS Glue column statistics jobs utilizing AWS Lambda and an Amazon EventBridge

Arrange assets with AWS CloudFormation

This put up features a CloudFormation template for a fast setup. You’ll be able to assessment and customise it to fit your wants. Be aware that this CloudFormation template requires a area with not less than 3 Availability Zones. The template generates the next assets:

  • A digital non-public cloud (VPC), public subnet, non-public subnets, and route tables
  • An Amazon Redshift Serverless workgroup and namespace
  • An S3 bucket to retailer the TPC-DS dataset, column statistics, job scripts, and so forth
  • Knowledge Catalog databases
  • An AWS Glue job to extract the TPS-DS dataset from the general public S3 bucket and save the info as an Iceberg desk in your S3 bucket
  • AWS Identification and Entry Administration (AWS IAM) roles and insurance policies
  • A Lambda perform and EventBridge schedule to run the AWS Glue column statistics on a schedule

To launch the CloudFormation stack, full the next steps:

  1. Sign up to the AWS CloudFormation console.
  2. Select Launch Stack.
  3. Select Subsequent.
  4. Depart the parameters as default or make acceptable modifications based mostly in your necessities, then select Subsequent.
  5. Evaluate the main points on the ultimate web page and choose I acknowledge that AWS CloudFormation would possibly create IAM assets.
  6. Select Create.

This stack can take round 10 minutes to finish, after which you’ll view the deployed stack on the AWS CloudFormation console.

Run an AWS Glue job to create Iceberg tables for the 3TB TPC-DS dataset

When the CloudFormation stack creation is full, run the AWS Glue job to create Iceberg tables for the TPC-DS dataset. This AWS Glue job extracts the TPC-DS dataset from the general public S3 bucket and transforms the info into Iceberg tables. These tables are loaded into your S3 bucket and registered to the Knowledge Catalog.

To run the AWS Glue job, full the next steps:

  1. On the AWS Glue console, select ETL jobs within the navigation pane.
  2. Select InitialDataLoadJob-<your-stack-name>.
  3. Select Run.

This AWS Glue job can take round half-hour to finish. The method is full when the job processing standing reveals as Succeeded.

The AWS Glue job creates tables storing the TPC-DS dataset in two similar databases: tpcdsdbnostats and tpcdsdbwithstats. The tables in tpcdsdbnostats can have no generated statistics, and we use them as reference. We generate statistics on tables in tpcdsdbwithstats. Verify the creation of these two databases and underlying tables on the AWS Glue console. Right now, these databases maintain the identical information and there aren’t any statistics generated on the tables.

Run queries on Redshift Spectrum with out statistics

Within the earlier steps, you arrange a Redshift Serverless workgroup with the given RPU (128 by default), ready the TPC-DS 3TB dataset in your S3 bucket, and created Iceberg tables (which at present don’t have statistics).

To run your question in Amazon Redshift, full the next steps:

  1. Obtain the Amazon Redshift queries.
  2. Within the Redshift question editor v2, run the queries listed within the Redshift Question for tables with out column statistics part within the downloaded file redshift-tpcds-sample.sql.
  3. Be aware the question runtime of every question.

Generate Iceberg column statistics

To generate statistics on the Knowledge Catalog tables, full the next steps:

  1. On the AWS Glue console, select Databases below Knowledge Catalog within the navigation pane.
  2. Select the tpcdsdbwithstats database to view all accessible tables.
  3. Choose any of those tables (for instance, call_center).
  4. Go to Column statistics – new and select Generate statistics.
  5. Maintain the default choices:
    1. For Select columns, choose Desk (All columns).
    2. For Row sampling choices, choose All rows.
    3. For IAM position, select AWSGluestats-blog-<your-stack-name>.
  6. Select Generate statistics.

You’ll be capable to see standing of the statistics technology run as proven within the following screenshot.

After you generate the Iceberg desk column statistics, you need to be capable to see detailed column statistics for that desk.

Following the statistics technology, you can see an <id>.stat file within the AWS Glue desk’s underlying information location in Amazon S3. This file is a Puffin file that shops the Theta Sketch information construction. Question engines can use this Theta Sketch algorithm to effectively estimate the NDV when working on the desk, which helps optimize question efficiency.

Reiterate the earlier steps to generate statistics for all tables, similar to catalog_sales, catalog_returns, warehouse, merchandise, date_dim, store_sales, buyer, customer_address, web_sales, time_dim, ship_mode, web_site, and web_returns. Alternatively, you possibly can manually run the Lambda perform that instructs AWS Glue to generate column statistics for all tables. We focus on the main points of this perform later on this put up.

After you generate statistics for all tables, you possibly can assess the question efficiency for every question.

Run queries on Redshift Spectrum with statistics

Within the earlier steps, you arrange a Redshift Serverless workgroup with the given RPU (128 by default), ready the TPC-DS 3TB dataset in your S3 bucket, and created Iceberg tables with column statistics.

To run the supplied question utilizing Redshift Spectrum on the statistics tables, full the next steps:

  1. Within the Redshift question editor v2, run the queries listed in Redshift Question for tables with column statistics part within the downloaded file redshift-tpcds-sample.sql.
  2. Be aware the question runtime of every question.

With Redshift Serverless 128 RPU and the TPC-DS 3TB dataset, we performed pattern runs for 10 chosen TPC-DS queries the place NDV data was anticipated to be useful. We ran every question 10 instances. The outcomes proven within the following desk are sorted by the proportion of the efficiency enchancment for the queries with column statistics.

TPC-DS 3T Queries With out Column Statistics With Column Statistics Efficiency Enchancment (%)
Question 16 305.0284 51.7807 489.1
Question 75 398.0643 110.8366 259.1
Question 78 169.8358 52.8951 221.1
Question 95 35.2996 11.1047 217.9
Question 94 160.52 57.0321 181.5
Question 68 14.6517 7.4745 96
Question 4 217.8954 121.996 78.6
Question 72 123.8698 76.215 62.5
Question 29 22.0769 14.8697 48.5
Question 25 43.2164 32.8602 31.5

The outcomes demonstrated clear efficiency advantages starting from 31.5–489.1%.

To dive deep, let’s discover question 16, which confirmed the very best efficiency profit:

TPC-DS Question 16:

choose    rely(distinct cs_order_number) as "order rely"   ,sum(cs_ext_ship_cost) as "whole delivery value"   ,sum(cs_net_profit) as "whole internet revenue" from    "awsdatacatalog"."tpcdsdbwithstats"."catalog_sales" cs1   ,"awsdatacatalog"."tpcdsdbwithstats"."date_dim"   ,"awsdatacatalog"."tpcdsdbwithstats"."customer_address"   ,"awsdatacatalog"."tpcdsdbwithstats"."call_center" the place     d_date between '2000-2-01'      and dateadd(day, 60, forged('2000-2-01' as date))     and cs1.cs_ship_date_sk = d_date_sk     and cs1.cs_ship_addr_sk = ca_address_sk     and ca_state="AL"     and cs1.cs_call_center_sk = cc_call_center_sk     and cc_county in ('Dauphin County','Levy County','Luce County','Jackson County',                     'Daviess County') and exists (choose *             from "awsdatacatalog"."tpcdsdbwithstats"."catalog_sales" cs2             the place cs1.cs_order_number = cs2.cs_order_number             and cs1.cs_warehouse_sk <> cs2.cs_warehouse_sk) and never exists(choose *                from "awsdatacatalog"."tpcdsdbwithstats"."catalog_returns" cr1                the place cs1.cs_order_number = cr1.cr_order_number) order by rely(distinct cs_order_number) restrict 100;

You’ll be able to evaluate the distinction between the question plans with and with out column statistics with the ANALYZE question.

The next screenshot reveals the outcomes with out column statistics.

The next screenshot reveals the outcomes with column statistics.

You’ll be able to observe some notable variations on account of utilizing column statistics. At a excessive degree, the general estimated value of the question is considerably diminished, from 20633217995813352.00 to 331727324110.36.

The 2 question plans selected totally different be part of methods.

The next is one line included within the question plan with out column statistics:

XN Hash Be a part of DS_DIST_BOTH (cost45365031.50 rows=10764790749 width=44) " Outer Dist Key: ""outer"".cs_order_number" Internal Dist Key: volt_tt_61c54ae740984.cs_order_number " Hash Cond: ((""outer"".cs_order_number = ""inside"".cs_order_number) AND (""outer"".cs_warehouse_sk = ""inside"".cs_warehouse_sk))"

The next is the corresponding line within the question plan with column statistics:

XN Hash Be a part of DS_BCAST_INNER (value=307193250965.64..327130154786.68 rows=17509398 width=32) " Hash Cond: ((""outer"".cs_order_number = ""inside"".cs_order_number) AND (""outer"".cs_warehouse_sk = ""inside"".cs_warehouse_sk))"

The question plan for the desk with out column statistics used DS_DIST_BOTH when becoming a member of massive tables, whereas the question plan for the desk with column statistics selected DS_BCAST_INNER. The be part of order has additionally modified based mostly on the column statistics. These be part of technique and be part of order modifications are primarily pushed by extra correct be part of cardinality estimations, that are doable with column statistics, and lead to a extra optimized question plan.

Schedule AWS Glue column statistics Runs

Sustaining up-to-date column statistics is essential for optimum question efficiency. This part guides you thru automating the method of producing Iceberg desk column statistics utilizing Lambda and EventBridge Scheduler. This automation retains your column statistics updated with out handbook intervention.

The required Lambda perform and EventBridge schedule are already created by the CloudFormation template. The Lambda perform is used to invoke the AWS Glue column statistics run. First, full the next steps to discover how the Lambda perform is configured:

  1. On the Lambda console, select Features within the navigation pane.
  2. Open the perform GlueTableStatisticsFunctionv1.

For a clearer understanding of the Lambda perform, we advocate reviewing the code within the Code part and analyzing the surroundings variables below Configuration.

As proven within the following code snippet, the Lambda perform invokes the start_column_statistics_task_run API by the AWS SDK for Python (Boto3) library.

Subsequent, full the next steps to discover how the EventBridge schedule is configured:

  1. On the EventBridge console, select Schedules below Scheduler within the navigation pane.
  2. Find the schedule created by the CloudFormation console.

This web page is the place you handle and configure the schedules in your occasions. As proven within the following screenshot, the schedule is configured to invoke the Lambda perform day by day at a particular time—on this case, 08:27 PM UTC. This makes positive the AWS Glue column statistics runs on a daily and predictable foundation.

Clear up

When you may have completed all of the above steps, keep in mind to scrub up all of the AWS assets you created utilizing AWS CloudFormation:

  1. Delete the CloudFormation stack.
  2. Delete S3 bucket storing the Iceberg desk for the TPC-DS dataset and the AWS Glue job script.

Conclusion

This put up launched a brand new characteristic within the Knowledge Catalog that allows you to create Iceberg desk column-level statistics. The Iceberg desk shops Theta Sketch, which can be utilized to estimate NDV effectively in a Puffin file. The Redshift Spectrum CBO can use that to optimize the question plan, leading to improved question efficiency and potential value financial savings.

Check out this new characteristic within the Knowledge Catalog to generate column-level statistics and enhance question efficiency, and tell us your suggestions within the feedback part. Go to the AWS Glue Catalog documentation to be taught extra.


Concerning the Authors

Sotaro Hikita is a Options Architect. He helps clients in a variety of industries, particularly the monetary {industry}, to construct higher options. He’s notably keen about huge information applied sciences and open supply software program.

Noritaka Sekiyama is a Principal Massive Knowledge Architect on the AWS Glue staff. He’s chargeable for constructing software program artifacts to assist clients. In his spare time, he enjoys biking together with his new street bike.

Kyle Duong is a Senior Software program Improvement Engineer on the AWS Glue and AWS Lake Formation staff. He’s keen about constructing huge information applied sciences and distributed techniques.

Kalaiselvi Kamaraj is a Senior Software program Improvement Engineer with Amazon. She has labored on a number of initiatives inside the Amazon Redshift question processing staff and at present specializing in performance-related initiatives for Redshift information lakes.

Sandeep Adwankar is a Senior Product Supervisor at AWS. Primarily based within the California Bay Space, he works with clients across the globe to translate enterprise and technical necessities into merchandise that allow clients to enhance how they handle, safe, and entry information.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles