Wednesday, April 2, 2025

As a leading online employment marketplace, REA Group leverages Amazon Managed Service for Kubernetes (Amazon EKS) to power their scalable and highly available infrastructure. In this instance, they focus on optimising cluster capability planning by employing the following best practices:

Enterprises requiring seamless sharing of substantial volumes of knowledge across multiple domains and service providers must develop a scalable cloud infrastructure capable of adapting to evolving demands. The REA Group, a leading digital real estate specialist, overcame this challenge by leveraging Amazon Managed Streaming for Apache Kafka (MSK) and the Hydro data streaming platform.

With a team of over 3,000 people, REA Group is driven by its mission to revolutionize the way the world interacts with property. We empower individuals to navigate every aspect of their real estate journey – from purchasing, selling, and renting, to comprehensive property expertise – through unparalleled content, intelligence, and expert guidance, backed by precise valuation estimates and streamlined financing solutions. We deliver unmatched value to Australian real estate agents by providing access to their largest and most targeted audience of property seekers.

To achieve seamless integration, distinct technical products within an organization typically require efficient and reliable knowledge transfer across domains and services.

Across the Knowledge Platform team, we’ve developed a real-time information sharing platform called Hydro to seamlessly integrate this capability across the entire organization. Hydro leverages Amazon Managed Service for Apache Kafka (MSK), integrating seamlessly with various tools that enable teams to efficiently share, refine, and disseminate expertise at ultra-low latencies via event-driven architecture design. This type of architecture is fundamental to REA’s approach to building microservices and enables timely knowledge processing in both real-time and batch settings, such as time-sensitive outbound messaging, personalization, and machine learning applications.

Our submission details a methodology for planning Microsoft SQL Server (MSK) clustering capabilities.

The issue

Hydro enables large-scale Amazon Managed Streaming for Kafka (MSK) deployments by providing configuration abstractions that allow customers to focus on generating value for REA without incurring the cognitive burden of infrastructure management. As the demand for hydroelectric power increases within the Renewable Energy Authority (REA), it is crucial to conduct thorough capability planning to meet customer demands while maintaining optimal efficiency and cost-effectiveness.

Provisioned Microsoft Kubernetes (MSK) clusters are leveraged by Hydro to enhance innovation and production processes within their improvement and manufacturing settings. Throughout various environments, Hydro efficiently oversees a solitary MSK cluster, housing multiple tenants with distinct workload requirements. Proper capability planning ensures that clusters are equipped to handle excessive traffic, guaranteeing a consistent level of service for all customers.

Actual-time streaming technology is a relatively recent innovation at REA. While many customers may lack proficiency in Apache Kafka, accurately determining their workload requirements proves to be a challenging task. To ensure the reliability and scalability of the Hydro platform, we must develop a proactive approach to capability planning that takes into account the potential impact of user workloads on our clusters, thereby optimizing their performance and efficiency.

Objectives

The process of capability planning involves determining the optimal configuration and scaling of a cluster, taking into account current and anticipated workload demands, as well as factors such as data replication, network bandwidth, and storage capacity.

Without effective capacity planning, hydro clusters risk being inundated by excessive traffic and ultimately failing to deliver the desired level of service to customers? It is imperative for us to invest time and resources in capability planning to ensure that our Hydro clusters can deliver the reliability and scalability demanded by modern applications.

The capability planning process we adhere to in Hydro encompasses three core domains:

  • Formulas employed to calculate current and anticipated capabilities, along with the attributes serving as variables within these mathematical frameworks.
  • Estimating the necessary capabilities for newly integrated Hydro workloads on the platform requires careful consideration of specific fashion parameters.
  • The toolset available to operators and custodians allows for a comprehensive evaluation of both historical and current capacity utilisation on the platform.

The following diagram illustrates the intricate relationship between capability utilization and its corresponding maximum utilization.

Although this capability doesn’t currently exist, our ultimate goal is to further develop this approach by incorporating predictive modeling to estimate the approximate timeframe for resource depletion, as illustrated in the accompanying diagram.

To ensure the reliability and eco-friendliness of our digital infrastructure, it is essential to maintain comprehensive visibility into current system utilization. By conducting a thorough review, we’re able to identify both the performance constraints of our existing infrastructure and pinpoint potential areas of congestion before they impact our services and client relationships.

By establishing and tracking clear-cut performance benchmarks, we will receive timely notifications and initiate crucial capacity adjustments. This methodology ensures that our infrastructure effectively accommodates unexpected surges in demand without sacrificing operational efficiency, ultimately enabling a seamless customer experience while maintaining the integrity of our system.

Resolution overview

The MSK clusters within Hydro’s infrastructure are configured with a PER_TOPIC_PER_BROKER Stage of monitoring that provides metrics on both dealer and matter ranges? The provided metrics effectively guide our determination of the cluster utilization’s key characteristics.

While it may be tempting to showcase an extensive range of metrics on our monitoring dashboards, it would indeed be impractical to do so, as this would ultimately lead to diminished readability and slowed insights into the cluster’s performance. Selecting the most relevant metrics for capability planning is crucial rather than showcasing numerous metrics.

Cluster utilization attributes

Based primarily on Amazon MSK best practices guidelines, our team has identified several crucial indicators to assess the overall health and performance of an MSK cluster. These attributes embody the next:

  • In/out throughput
  • CPU utilization
  • Disk house utilization
  • Reminiscence utilization
  • Producer and shopper latency
  • Producer and shopper throttling

For additional guidance on optimizing cluster sizes, refer to , , , and.

The subsequent desk provides a comprehensive inventory of attributes utilized for MSK cluster capacity planning within the Hydro framework.

Bytes in Throughput Bytes per second Amazon EC2 instance types that support Enhanced Networking require an Elastic Network Interface (ENI) with a supported MAC address size.

Can you store up to 16 TB of data on Amazon S3? Yes, if your application is designed to handle extremely large datasets or you have specific compliance requirements

Bytes out Throughput Bytes per second Dependent on the specific combination of Amazon EC2 instances, Amazon EBS volumes, and EBS storage throughput settings used.
Shopper latency Latency Milliseconds Unacceptably high latency values typically signal a decline in end-user proficiency sooner than the exhaustion of system resources such as CPU and memory.
CPU utilization Capability limits Percentage of CPU consumed by processes: % CPU Consumer + CPU System Utilization. Ought to keep beneath 60%
Disk house utilization Persistent storage Bytes Ought to keep beneath 85%
Reminiscence utilization Capability limits % Reminiscence in use Ought to keep beneath 60%
Producer latency Latency Milliseconds Signs of declining user proficiency can emerge long before reaching technical capacity constraints or exhausting resources like CPU or memory, with prolonged latency values serving as an early warning indicator.
Throttling Capability limits Milliseconds, bytes, or messages Unacceptable levels of sustained throttling suggest that system capabilities are being exceeded much sooner than resources such as CPU or memory become depleted.

By tracking these key performance indicators, we will quickly assess the scalability of the clusters as we introduce additional workloads onto the system. We subsequently align these attributes with corresponding MSK performance indicators that are readily available.

Cluster capability limits

During initial capacity assessments, our Microsoft Site Cluster (MSK) groups were struggling to attract enough traffic to provide a clear understanding of their capacity constraints. To assess the theoretical efficiency limits of our approach, we leveraged Apache Kafka’s capabilities. We conducted thorough efficiency and capability evaluations of the identical MSK cluster configurations across our production and development environments to ensure seamless alignment. By running multiple test scenarios, we gained a thorough comprehension of the cluster’s performance capabilities. Determination of Check Cluster Efficiency Metrics

Within a defined timeframe and budget constraints, we focused on assessment scenarios that could effectively gauge the cluster’s capabilities. We conducted assessments focusing on directing large volumes of traffic to the cluster and developing scenarios featuring numerous partitions.

Following each assessment, we aggregated the relevant metrics from the check clusters and isolated the most significant values for key utilization attributes. We subsequently combined the findings and established the most suitable boundaries for each characteristic. The accompanying screenshot illustrates the performance metrics of a exported check cluster’s operational effectiveness.

Capability monitoring dashboards

As part of our platform administration course, we conduct monthly operational reviews to maintain optimal efficiency. This analysis involves examining a comprehensive automatic operational report that encompasses all methodologies employed within the platform. The overview is grounded in a thorough examination of service level objectives (SLOs), which are informed by strategically selected service level indicators (SLIs). A retrospective analysis of monitoring alerts generated during the preceding month also plays a crucial role in this assessment. To this end, we will identify key issues and implement adjustments accordingly.

To inform our operational decisions and provide a clear outline of each cluster’s usage, we created a capability monitoring dashboard, exemplified in the accompanying screenshot, tailored to each environment. We developed the dashboard as infrastructure as code (IaC), leveraging the capabilities of AWS Cloud Development Kit (CDK). The dashboard is automatically generated and managed as a fundamental component of the platform’s underlying infrastructure, in tandem with the MSK cluster.

By configuring the MSK cluster’s maximum capabilities within a designated file, the defined parameters are automatically imported into the capability dashboard, where they become annotated visualizations within the graph widgets’ framework. The capability limits annotations provide a clear visual representation of the cluster’s capabilities, offering insight into potential headroom based on current utilization.

Based on the efficiency testing, we determined the capability limitations for throughput, latency, and throttling. The capability limitations of opposing metrics, such as CPU, disk storage, and memory, are guided by Amazon MSK best-practices guidelines.

While reviewing operational insights, we actively scrutinize capability tracking gauges to determine whether supplementary capabilities need to be integrated into the cluster. By identifying and addressing potential inefficiencies early on, we can proactively mitigate their impact on customer workloads. It’s a proactive measure rather than a reactive response to prevent efficiency degradation.

Preemptive CloudWatch alarms

Our team has implemented proactive CloudWatch alarm capabilities in conjunction with customizable monitoring dashboards. The alarms are set to trigger notifications before a specific performance indicator surpasses its predetermined threshold, warning us when the prolonged value exceeds 80% of the capacity limit. With this monitoring technique, we can swiftly respond instead of waiting for our regular monthly review cycle.

Our capability planning methodology yields considerable value.

By consistently applying our capability planning methodology as part of the Hydro platform operations, we have established a reliable framework for evaluating the gap between the theoretical limits of each cluster’s performance and its actual configuration. Our capability monitoring dashboards serve as a vital observability tool, which we review regularly to monitor performance and troubleshoot efficiency issues effectively. They help us quickly identify whether capability constraints could be the underlying cause of any current issues. We intend to apply our existing capability planning methodology and tools in a proactive or reactive manner, depending on the situation and our needs.

One notable advantage of this approach is that it enables us to determine the theoretical maximum utilization thresholds that a given cluster can encounter without affecting specific users of the platform when interacting with a separate cluster. We utilize our AWS CDK-based automation to rapidly provision temporary Microsoft SQL Server (MSK) clusters and conduct comprehensive capability assessments on these ephemeral environments. To assess the impact of adjustments to the cluster’s settings on its perceived capacity boundaries regularly. If the newly calculated limits differ from those previously identified, they automatically update our capability dashboards and CloudWatch alarms.

Future evolution

Hydro’s innovative approach has led to continuous improvement through the incorporation of cutting-edge features and advancements. Among other options, Confluent Cloud uniquely offers the flexibility to easily create Kafka consumer applications. To meet the increasing demand, staying ahead in capability planning is crucial. Although the approach demonstrated thus far has been effective, it’s not a definitive milestone, and opportunities for growth and areas for improvement remain.

Multi-cluster structure

To optimize critical workload management, we’re exploring the potential of adopting a multi-cluster architecture utilizing Amazon Managed Service for Apache Kafka (MSK), which would likely impact our capacity planning processes. As we progress, our strategy involves profiling workloads primarily based on metadata, verifying them against capability metrics, and then allocating them to an acceptable MSK cluster. With existing MSK clusters in place, we will explore how a cluster of this kind can enhance our overall platform architecture.

Utilization developments

We have enhanced our capability monitoring dashboards with graphical tools to track and identify any unusual trends or anomalies. Notwithstanding the limitations imposed by the CloudWatch anomaly detection algorithm’s focus on evaluating data from only two weeks’ worth of metrics, we will reassess its value as we onboard additional workloads to determine their suitability for integration with our existing monitoring infrastructure. Besides identifying utilization trends, we will explore opportunities to develop an algorithm with predictive capabilities to detect when MSK cluster assets are likely to degrade and deplete.

Conclusion

Preliminary capability planning establishes a solid foundation for future advancements, providing a secure and streamlined onboarding process for workloads. To maximize the effectiveness of our platform, it is essential that our capability planning methodology adapts seamlessly to the platform’s evolving development. Through this partnership, we maintain a deep collaboration with AWS to continuously develop tailored solutions that align with our business needs and harmonize with the Amazon MSK roadmap. By staying ahead of the curve, we can confidently deliver exceptional services to our clients.

We urge all Amazon Managed Services for Kubernetes (MSK) customers to capitalize on the full potential of their clusters by developing a comprehensive plan. By adopting the strategies outlined, organizations can lay the groundwork for streamlined processes and substantial cost reductions over time.


In regards to the Authors

As a highly skilled and dedicated Workers Knowledge Engineer at REA, With a background in software engineering spanning various sectors, she has recently focused on applying her expertise to the field of real estate. As a passionate advocate, she supports young women eager to transition into the tech industry, drawing inspiration from her own role models in this space.

Serves as a Workers’ Programs Engineer at REA. With extensive experience in designing and implementing massive distributed systems. He focuses on leveraging automation, observability, and Service Reliability Engineering best practices to drive high-availability and performance in mission-critical systems and applications.

As a Technical Account Supervisor at Amazon Web Services (AWS), He specializes in environmentally sustainable computing solutions, with a profound passion for Linux and open-source technologies that enables him to support corporate clients in modernizing and optimizing their cloud infrastructure.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles