Achieve Optimal Price-Performance on Amazon Redshift through Adaptive Elastic Histograms for Enhanced Selectivity Estimation

October 29, 2024

94

A cloud-based data warehousing solution, providing scalability, manageability, and the ability to process complex SQL analytics on both structured and semi-structured data. This feature enables seamless integration of your data into operational databases, knowledge lakes, or external datasets, minimizing the need for manual movement or duplication of information. Thousands of businesses leverage Amazon Redshift to process enormous data volumes, transform their analytics capabilities, and deliver actionable insights to their clients.

Amazon Redshift remains a trusted leader in the knowledge warehouse market, offering exceptional price-performance capabilities. Amazon Redshift’s advanced Question Optimizer plays a pivotal role in its overall performance excellence. The question optimizer is responsible for identifying the most efficient strategy to resolve a query. By leveraging statistical insights and coupling them with contextual queries, the system generates a price estimate for various plan scenarios.

Amazon Redshift features integrated autonomics that automatically gathers statistics, also known as automated analyze or auto analyze. Auto analytics is a continuous process running silently in the background on Redshift tables, ensuring accurate and timely updates of statistical data. While statistics aggregation may not be excessively burdensome in itself, the computational expenses can quickly escalate as data continues to flow in rapidly, thereby rendering real-time maintenance of statistics a significant challenge. As data accumulates within the Redshift knowledge warehouse, statistics may become outdated, leading to imprecise selectivity estimates and subsequently subpar query plans that impede question performance.

Challenges with stale statistics

Our analysis of buyer workloads using Redshift revealed a critical challenge in estimating predicate selectivity for queries involving temporal columns like DATE and TIMESTAMP, where statistical staleness plays a particularly significant role. Accordingly, it is found that approximately 11% of predicate columns in queries across the Amazon Redshift fleet are characterized by DATE and TIMESTAMP attributes; concurrently, more than 40% of query scans within this fleet have predicates on DATE or TIMESTAMP columns; and predictably, customer workloads tend to query current (hot) data more frequently than historic (cold) data. A typical example of a buying workload derived from the commonplace analytics benchmark is as follows:

SELECT ... FROM   lineitem        JOIN orders ON l_orderkey = o_orderkey        JOIN buyer ON ... WHERE l_shipdate >= current_date - $1   AND ...

Figure 1: Amazon Redshift fleet metrics on temporal vs non-temporal data types

What are the key performance indicators for temporal and non-temporal data in an Amazon Redshift fleet?

Resolution overview

Amazon Redshift introduced a novel selectivity estimation technique in version 1.0.75379, aimed at addressing the challenge of maintaining accurate statistics for temporal columns and subsequently optimizing query plans for improved performance. The innovative approach seamlessly captures real-time statistical metadata during knowledge ingestion without introducing any additional computational burdens. When handling queries with diverse temporal predicates, the query optimizer leverages this real-time metadata to refine prevailing statistics by dynamically adjusting histogram boundaries, thereby enhancing selectivity estimates for temporal-based queries. See Figures 2 & 3 for the efficiency enhancements that elastic histograms for selectivity estimation delivers. Question processing optimization is seamlessly activated, eliminating the need for customer configuration adjustments or human involvement, allowing users to immediately benefit from the efficiencies and advantages that come with automated optimization.

Benchmark analysis

We assessed the novel selectivity estimation technique using diverse configurations of the TPC-H query suite. The query executes multiple joins with varying degrees of complexity. lineitemOrders are aggregated and joined with various tables containing numerous predicates, accompanied by one l_shipdate.

When histogram statistics are outdated, the accuracy of predicate selectivity estimates is compromised. l_shipdate have been incorrectly predicted. The lacklustre planning process yielded an inefficient query strategy, characterised by excessive inter-node communication and data retransmissions within the Amazon Redshift- provisioned cluster or serverless architecture. With the introduction of our novel selectivity estimation approach, predictions became significantly more accurate, yielding an optimal question plan featuring a balanced order that maximized outcome consistency across successive steps, thereby demonstrating a substantial improvement in efficiency as depicted in Figure 2.

Figure 2: Relative performance of TPC-H query variant (lower is better)

Determine 1.5: Relative efficiency comparison of TPC-H query variants – a decrease is most desirable.

Figure 3: Query Plan comparison: Before enhancement (left), After enhancement (right)

How do the current and future states of determinate growth compare?

Conclusion

We have implemented novel efficiency enhancements to Redshift data warehousing query processing, leveraging elastic histogram statistics to significantly enhance selectivity estimation and overall query plan quality for Amazon Redshift queries in the absence of recent table statistics.

Amazon Redshift now offers improved query efficiency, leveraging metadata statistics collected during ingestion to provide optimizations comparable to Enhanced Histograms for Selectivity Estimation in the absence of recent statistics. This enhancement is enabled by default, benefiting Amazon Redshift customers with faster query response times for their workloads. Amazon Redshift is committed to continually improving its price-performance capabilities for enhanced efficiency. Since its introduction in patch release P183, the newly enhanced selectivity estimation feature has significantly boosted the efficiency of thousands of customer queries within the Amazon Redshift fleet. Here’s a revised version:

We’re pleased to highlight another instance of our ongoing efforts to continually improve and refine Redshift, ensuring it remains the industry leader in terms of price-performance.

We’re excited to introduce several innovative features and performance boosts within Amazon Redshift, allowing you to unlock even more value from your data analytics. To access additional information and experience the benefits firsthand, reach out to your AWS account team to schedule a complimentary session or demo with Amazon Redshift. To provide unparalleled support, they will offer additional guidance and assistance in identifying the most suitable analytics solution that aligns with your organization’s specific requirements.

Concerning the authors

As a software program growth engineer on Amazon Redshift’s team, I specialize in enhancing query efficiency and optimization. He earned a Bachelor of Arts degree in Computer Science and Mathematics from Cornell College.

Serves as an Engineering Supervisor at Amazon Redshift. Prior to joining Amazon, Mohammed accumulated 12 years of industry experience in optimizing trading and database internals, initially as an individual contributor and later as an engineering supervisor. With a storied career, Mohammed boasts an impressive 18 US patents to his name, while also having published notable works in the prestigious databases of premier conferences like EDBT, ICDE, SIGMOD, and VLDB. Mohammed earned his PhD in Computer Science from the University of Vermont, along with Master’s and Bachelor’s degrees in Data Sciences from Cairo University.

As a senior technical leader on the Amazon Redshift team. Mengchu currently focuses on optimizing questions for enhanced performance and streamlining knowledge lake inquiry processes for increased efficiency. He also spearheaded the development of SQL language variants. Mengchu earned his PhD in Computer Science and Engineering from the University of Nebraska-Lincoln.

As a seasoned leader on Amazon Redshift’s senior team, the individual oversees several key domains, including spatial analytics, streaming analytics, query optimization, Spark integration, and enterprise analytics strategy, leveraging expertise in AI, knowledge management, and data science to drive innovation and growth. With proficiency in relational databases, multi-dimensional databases, and Internet of Things (IoT) applied sciences, as well as experience with storage and compute infrastructure companies, he has also founded startups focused on Artificial Intelligence (AI) and deep learning. Ravi has earned dual Bachelor’s degrees in Physics and Electrical Engineering from Washington College, specifically. Louis, a Master of Science in Engineering graduate from Stanford University, and holder of a Master of Business Administration degree from the University of Chicago Booth School of Business.

Achieve Optimal Price-Performance on Amazon Redshift through Adaptive Elastic Histograms for Enhanced Selectivity Estimation

Challenges with stale statistics

Resolution overview

Benchmark analysis

Conclusion

Concerning the authors

Related Articles

WatchGuard warns of essential vulnerability in Firebox firewalls

MIT’s CHEFSI Brings Collectively AI, HPC, And Supplies Knowledge For Superior Simulations

Construct data-rich brokers on an enterprise-ready basis

LEAVE A REPLY Cancel reply

Latest Articles

WatchGuard warns of essential vulnerability in Firebox firewalls

MIT’s CHEFSI Brings Collectively AI, HPC, And Supplies Knowledge For Superior Simulations

Construct data-rich brokers on an enterprise-ready basis

Making LLMs extra correct through the use of all of their layers

Meet the Household of Flying Robots Serving the Wants of Industries Our Civilization Depends On – sUAS Information