Databases and question engines, together with Amazon Redshift, usually depend on totally different statistics concerning the underlying knowledge to find out the simplest method to execute a question, such because the variety of distinct values and which values have low selectivity. When Amazon Redshift receives a question, equivalent to
the question planner makes use of statistics to make an informed guess on the best technique to load and course of knowledge from storage. Extra statistics concerning the underlying knowledge can usually assist a question planner choose a plan that results in the most effective question efficiency, however this may require a tradeoff among the many value of computing, storing, and sustaining statistics, and would possibly require further question planning time.
Information lakes are a robust structure to prepare knowledge for analytical processing, as a result of they let builders use environment friendly analytical columnar codecs like Apache Parquet, whereas letting them proceed to switch the form of their knowledge as their purposes evolve with open desk codecs like Apache Iceberg. One problem with knowledge lakes is that they don’t at all times have statistics about their underlying knowledge, making it tough for question engines to find out the optimum execution path. This may result in points, together with gradual queries and sudden adjustments in question efficiency.
In 2024, Amazon Redshift clients queried over 77 EB (exabytes) of information residing in knowledge lakes. Given this utilization, the Amazon Redshift workforce works to innovate on knowledge lake question efficiency to assist clients effectively entry their open knowledge to get close to real-time insights to make vital enterprise choices. In 2024, Amazon Redshift launched a number of options that enhance question efficiency for knowledge lakes, together with quicker question instances when an information lake doesn’t have statistics. With Amazon Redshift patch 190, the TPC-DS 3TB benchmark confirmed an general 2x question efficiency enchancment on Apache Iceberg tables with out statistics, together with TPC-DS Question #72, which improved by 125 instances from 690 seconds to five.5 seconds.
On this put up, we first briefly evaluation how planner statistics are collected and what influence they’ve on queries. Then, we focus on Amazon Redshift options that ship optimum plans on Iceberg tables and Parquet knowledge even with the shortage of statistics. Lastly, we evaluation some instance queries that now execute quicker due to these newest Amazon Redshift improvements.
Stipulations
The benchmarks on this put up had been run utilizing the next atmosphere:
- Amazon Redshift Serverless with a base capability of 88 RPU (Amazon Redshift processing unit)
- The Cloud Information Warehouse Benchmark derived from the TPC-DS 3TB dataset. The next tables had been partitioned on this dataset (the remainder had been unpartitioned):
catalog_returns
oncr_returned_date_sk
catalog_sales
oncs_sold_date_sk
store_returns
onsr_returned_date_sk
store_sales
onss_sold_date_sk
web_returns
onwr_returned_date_sk
web_sales
onws_sold_date_sk
stock
oninv_date_sk
For extra info on loading the Cloud Information Warehouse Benchmark into your Amazon Redshift Serverless workgroup, see the Cloud Information Warehouse Benchmark documentation.
Now, let’s evaluation how database statistics work and the way they influence question efficiency.
Overview of the influence of planner statistics on question efficiency
To know why database statistics are essential, first let’s evaluation what a question planner does. A question planner is the mind of a database: once you ship a question to a database, the question planner should decide probably the most environment friendly method to load and compute all the knowledge required to reply the question. Having details about the underlying dataset, equivalent to statistics concerning the variety of rows in a dataset, or the distribution of information, can assist the question planner generate an optimum plan for retrieving the info. Amazon Redshift makes use of statistics concerning the underlying knowledge in tables and columns statistics to find out methods to construct an optimum question execution path.
Let’s see how this works in an instance. Think about the next question to find out the highest 5 gross sales dates in December 2024 for shops in North America:
On this question, the question planner has to contemplate a number of elements, together with:
- Which desk is bigger,
shops
orreceipts
? Am I in a position to question the smaller desk first to cut back the quantity of looking on the bigger desk? - Which returns extra rows,
receipts.insert_date BETWEEN '2024-12-01' AND '2024-12-31'
orshops.area = 'NAMER'
? - Is there any partitioning on the tables? Can I search over a smaller set of information to hurry up the question?
Having details about the underlying knowledge can assist to generate an optimum question plan. For instance, shops.area = 'NAMER'
would possibly solely return a number of rows (that’s, it’s extremely selective), that means it’s extra environment friendly to execute that step of the question first earlier than filtering by way of the receipts
desk. What helps a question planner make this determination is the statistics accessible on columns and tables.
Desk statistics (often known as planner statistics) present a snapshot of the info accessible in a desk to assist the question planner make an knowledgeable determination on execution methods. Databases gather desk statistics by way of sampling, which entails reviewing a subset of rows to find out the general distribution of information. The standard of statistics, together with the freshness of information, can considerably influence a question plan, which is why databases will reanalyze and regenerate statistics after a sure threshold of the underlying knowledge adjustments.
Amazon Redshift helps a number of desk and column degree statistics to help in constructing question plans. These embrace:
Statistic | What it’s | Influence | Question plan affect |
Variety of rows (numrows) | Variety of rows in a desk | Estimates the general dimension of question outcomes and JOIN sizes | Choices on JOIN ordering and algorithms, and useful resource allocation |
Variety of distinct values (NDV) | Variety of distinctive values in a column | Estimates selectivity, that’s, what number of rows can be returned from predicates (for instance, WHERE clause) and the dimensions of JOIN outcomes | Choices on JOIN ordering and algorithms |
NULL rely | Variety of NULL values in a column | Estimates variety of rows eradicated by IS NULL or IS NOT NULL | Choices on filter pushdown (that’s, what nodes execute a question) and JOIN methods |
Min/max values | Smallest and largest values in a column | Helps range-based optimizations (for instance, WHERE x BETWEEN 10 AND 20) | Choices on JOIN order and algorithms, and useful resource allocation |
Column dimension | Whole dimension of column knowledge in reminiscence | Estimates general dimension of scans (studying knowledge), JOINs, and question outcomes | Choices on JOIN algorithms and ordering |
Open codecs equivalent to Apache Parquet don’t have any of the previous statistics by default and desk codecs like Apache Iceberg have a subset of the previous statistics equivalent to variety of rows, NULL rely and min/max values. This may make it difficult for question engines to plan environment friendly queries. Amazon Redshift has added improvements that enhance general question efficiency on knowledge lake knowledge saved in Apache Iceberg and Apache Parquet codecs even when all or partial desk or column-level statistics are unavailable. The subsequent part critiques options in Amazon Redshift that assist enhance question efficiency on knowledge lakes even when desk statistics aren’t current or are restricted.
Amazon Redshift options when knowledge lakes don’t have statistics for Iceberg tables and Parquet
As talked about beforehand, there are lots of circumstances the place tables saved in knowledge lakes lack statistics, which creates challenges for question engines to make knowledgeable choices on choosing the right question plan. Nonetheless, Amazon Redshift has launched a collection of improvements that enhance efficiency for queries on knowledge lakes even when there aren’t desk statistics accessible. On this part, we evaluation a few of these enhancements and the way they influence your question efficiency.
Dynamic partition elimination by way of distributed joins
Dynamic partition elimination is a question optimization approach that enables Amazon Redshift to skip studying knowledge unnecessarily throughout question execution on a partitioned desk. It does this by figuring out which partitions of a desk are related to a question and solely scanning these partitions, considerably decreasing the quantity of information that must be processed.
For instance, think about a schema that has two tables:
gross sales
(reality desk) with columns:sale_id
product_id
sale_amount
sale_date
merchandise
(dimension desk) with columns:product_id
product_name
class
The gross sales desk is partitioned by product_id
. Within the following instance, you wish to discover the entire gross sales quantity for merchandise within the Electronics class in December 2024.
SQL question:
How Amazon Redshift improves this question:
- Filter on dimension desk:
- The question filters the merchandise desk to solely embrace merchandise within the
Electronics
class.
- The question filters the merchandise desk to solely embrace merchandise within the
- Establish related partitions:
- With the brand new enhancements, Amazon Redshift analyzes this filter and determines which partitions of the gross sales desk must be scanned.
- It appears on the
product_id
values within the merchandise desk that match theElectronics
class and solely scans these particular partitions within the gross sales desk. - As an alternative of scanning your entire gross sales desk, Amazon Redshift solely scans the partitions that include gross sales knowledge for electronics merchandise.
- This considerably reduces the quantity of information Amazon Redshift must course of, making the question quicker.
Beforehand, this optimization was solely utilized on broadcast joins when all little one joins beneath the be a part of had been additionally broadcast joins. The Amazon Redshift workforce prolonged this functionality to work on all broadcast joins, regardless if the kid joins beneath them are broadcast. This permits extra queries to learn from dynamic partition elimination, equivalent to TPC-DS Q64 and Q75 for Iceberg tables, and TPC-DS Q25 in Parquet.
Metadata caching for Iceberg tables
The Iceberg open desk format employs a two-layer construction: a metadata layer and an information layer. The metadata layer has three ranges of recordsdata (metadata.json
, manifest lists, and manifests), which permits for efficiency options equivalent to quicker scan planning and superior knowledge filtering. Amazon Redshift makes use of the Iceberg metadata construction to effectively establish the related knowledge recordsdata to scan, utilizing partition worth ranges and column-level statistics and eliminating pointless knowledge processing.
The Amazon Redshift workforce noticed that Iceberg metadata is steadily fetched a number of instances each inside and throughout queries, resulting in potential efficiency bottlenecks. We applied an in-memory LRU (least not too long ago used) cache for parsed metadata, manifest listing recordsdata, and manifest recordsdata. This cache retains probably the most not too long ago used metadata in order that we keep away from fetching them repeatedly from Amazon Easy Storage Service (Amazon S3) throughout queries. This caching has helped with general efficiency enhancements of as much as 2% in a TPC-DS 3TB workload. We observe greater than 90% cache hits for these metadata constructions, decreasing the iceberg metadata processing instances significantly.
Stats inference for Iceberg tables
As talked about beforehand, the Apache Iceberg file format comes with some statistics equivalent to variety of rows, variety of nulls, column min/max values and column storage dimension within the metadata recordsdata known as manifest recordsdata. Nonetheless, they don’t at all times present all of the statistics that we want particularly common width which is essential for the cost-based optimizer utilized by Amazon Redshift.
We delivered a characteristic to estimate common width for variable size columns equivalent to string and binary from Iceberg metadata. We do that by utilizing the column storage dimension and the variety of rows, and we alter for column compression when vital. By inferring these further statistics, our optimizer could make extra correct value estimates for various question plans. This stats inference characteristic, launched in Amazon Redshift patch 186, presents as much as a 7% enchancment within the TPC-DS benchmarks. We have now additionally enhanced Amazon Redshift optimizer’s value mannequin. The enhancements embrace planner optimizations that enhance the estimations of the totally different be a part of distribution methods to take note of the networking value of distributing the info between the nodes of an Amazon Redshift cluster. The enhancements additionally embrace enhancements to Amazon Redshift question optimizer. These enhancements, that are a end result of a number of years of analysis, testing, and implementation demonstrated as much as a forty five% enchancment in a set of TPC-DS benchmarks.
Instance: TPC-DS benchmark highlights on Amazon Redshift no stats queries on knowledge lakes
One method to measure knowledge lake question efficiency for Amazon Redshift is utilizing the TPC-DS benchmark. The TPC-DS benchmark is a standardized benchmark designed to check determination assist techniques, particularly taking a look at concurrently accessed techniques the place queries can vary from shorter analytical queries (for instance, reporting, dashboards) to longer operating ETL-style queries for transferring and remodeling knowledge into a special system. For these checks, we used the Cloud Information Warehouse Benchmark derived from the TPC-DS 3TB to align our testing with many frequent analytical workloads, and supply an ordinary set of comparisons to measure enhancements to Amazon Redshift knowledge lake question efficiency.
We ran these checks throughout knowledge saved each within the Apache Parquet knowledge format, along with Apache Iceberg tables with knowledge in Apache Parquet recordsdata. As a result of we centered these checks on out-of-the-box efficiency, none of those knowledge units had any desk statistics accessible. We carried out these checks utilizing the required Amazon Redshift patch variations within the following desk, and used Amazon Redshift Serverless with 88 RPU with none further tuning. The next outcomes symbolize a energy run, which is the sum of how lengthy it took to run all of the checks, from a heat run, that are the outcomes of the ability run after a minimum of one execution of the workload:
P180 (12/2023) | P190 (5/2025) | |
Apache Parquet (solely numrows) | 7,796 | 3,553 |
Apache Iceberg (out-of-the-box, no tuning) | 4,411 | 1,937 |
We noticed notable enhancements in a number of question run instances. For this put up, we deal with the enhancements we noticed in question 82:
On this question, we’re looking for the highest 100 promoting manufacturers from a particular supervisor in December 2002, which represents a sometimes dashboard-style analytical question. In our energy run, we noticed a discount in question time from 512 seconds to 18.1 seconds for Apache Parquet knowledge, or a 28.2x enchancment in efficiency. The accelerated question efficiency for this question in a heat run is because of the enhancements to the cost-based optimizer and dynamic partition elimination.
We noticed question efficiency enhancements throughout lots of the queries discovered within the Cloud Information Warehouse Benchmark derived from the TPC-DS take a look at suite. We encourage you to attempt your individual efficiency checks utilizing Amazon Redshift Serverless in your knowledge lake knowledge to see what efficiency features you possibly can observe.
Cleanup
In the event you ran these checks by yourself and don’t want the assets anymore, you’ll must delete your Amazon Redshift Serverless workgroup. See Shutting down and deleting a cluster. In the event you don’t must retailer the Cloud Information Warehouse Benchmark knowledge in your S3 bucket anymore, see Deleting Amazon S3 objects.
Conclusion
On this put up, you realized how cost-based optimizers for databases work, and the way statistical details about your knowledge can assist Amazon Redshift execute queries extra effectively. You’ll be able to optimize question efficiency for Iceberg tables by mechanically amassing Puffin statistics, which lets Amazon Redshift use these latest improvements to extra effectively question your knowledge. Giving extra data to your question planner—the mind of Amazon Redshift—helps to supply extra predictable efficiency and lets you additional scale the way you work together along with your knowledge in your knowledge lakes and knowledge lakehouses.
In regards to the authors
Martin Milenkoski is a Software program Improvement Engineer on the Amazon Redshift workforce, presently specializing in knowledge lake efficiency and question optimization. Martin holds an MSc in Laptop Science from the École Polytechnique Fédérale de Lausanne.
Kalaiselvi Kamaraj is a Sr. Software program Improvement Engineer on the Amazon Redshift workforce. She has labored on a number of tasks throughout the Amazon Redshift Question processing workforce and presently specializing in efficiency associated tasks for Amazon Redshift DataLake and question optimizer.
Jonathan Katz is a Principal Product Supervisor – Technical on the AWS Analytics workforce and relies in New York. He’s a Core Staff member of the open-source PostgreSQL challenge and an lively open-source contributor, together with to the pgvector challenge.