As Amazon Web Services (AWS) expanded its offerings to include massive-scale data processing capabilities through Amazon Elastic MapReduce (EMR), the Financial Industry Regulatory Authority (FINRA) sought a robust monitoring solution to ensure seamless operations. The challenge lay in scaling real-time observability to accommodate EMR's enormous workloads on Amazon EC2, necessitating a collaborative effort between FINRA and AWS.

handles massive amounts of data, processing vast volumes of diverse information and complex workloads from numerous occasions. Amazon EMR is a cloud-based big data environment that enables processing of massive amounts of data using popular open-source tools such as Hadoop, Spark, HBase, Flink, Hudi, and Presto, thereby simplifying the complex process of extracting insights from large datasets.

Monitoring EMR clusters is crucial for real-time detection of critical events and issues affecting functions, infrastructure, or data. A sophisticated monitoring system enables swift identification of root causes, streamlines bug resolution through automation, minimizes manual interventions, and boosts productivity. While monitoring cluster efficiency and utilization over time enables operations and engineering teams to identify potential bottlenecks and optimize scaling, thereby reducing manual interventions and enhancing compliance with service level agreements.

This publication discusses the challenges faced by our team and details how we built an observability framework to provide operational metrics insights for large data processing workloads running on Amazon EMR clusters, which are deployed on Amazon EC2 instances.

Problem

In today’s data-driven era, organizations strive to derive valuable insights from vast amounts of information. What posed a significant challenge for us was finding a sustainable approach to track and monitor massive data processing workloads on Amazon EMR, hindered by the complexity of this system. Monitoring and observability for Amazon EMR options pose diverse complexities:

EMR clusters process vast amounts of data across multiple nodes. Monitoring such complex, distributed systems necessitates effectively managing overwhelming data influx to ensure optimal performance. Managing vast amounts of monitoring data from EMR clusters proves daunting, hindering efficient establishment and troubleshooting of issues in real-time.
EMR clusters often have an ephemeral nature, brought into existence and terminated in response to fluctuating workload demands. This dynamism hinders constant monitoring, acquisition of reliable metrics, and preservation of valuable observability across time periods, thereby posing significant challenges.
Monitoring the overall health of a cluster and gaining insight into its workings to identify bottlenecks, uncover unusual patterns in data processing, address information discrepancies, optimize job performance, and ensure overall efficiency are crucial considerations. Diving deep into the intricacies of large-scale clusters, nodes, and duties, while also factoring in the possibility of data biases, stuck tasks, and performance indicators like those provided by Spark and Java Virtual Machine, is crucial for a comprehensive understanding. Achieving seamless observability across diverse data types proved to be a significant challenge.
EMR clusters comprise a diverse array of components and organizations operating in tandem, rendering it challenging to effectively oversee the intricacies of the entire system. Monitoring the effective allocation of resources such as CPU, memory, and disk input/output across multiple nodes is crucial, especially in complex distributed environments, to prevent bottlenecks and optimize efficiency?
While capturing and analyzing latency and efficiency metrics in real-time is crucial for prompt issue resolution, this endeavor is hindered by the inherent complexity of Amazon EMR’s distributed architecture.
The challenge lay in providing a unified view of an EMR cluster’s performance, encompassing metrics for well-being, resource utilization, job execution, logs, and security – a comprehensive snapshot of the system’s efficacy.
Centralized alerting and notification systems proved challenging to organize efficiently. To strike a balance between timely notifications and avoidable distractions, careful planning is necessary when setting up alerts for critical events or performance metrics, allowing for prompt attention to vital matters while preventing unnecessary interruptions. Incidents arising from efficiency slowdowns or disruptions necessitate a prompt and diligent response, requiring a substantial investment of time and resources to identify and rectify the root causes.
Ultimately, balancing price optimization with continuous and effective monitoring remains a persistent challenge. Balancing the need for comprehensive monitoring against value constraints necessitates deliberate planning and optimized strategies to avoid unnecessary expenses while still ensuring sufficient surveillance security?

Effective observability for Amazon EMR necessitates a harmonious blend of tailored tools, methodologies, and best practices to successfully address the complexities and deliver reliable, eco-friendly, and cost-efficient large-scale data processing.

On Amazon EMR, the system monitors the overall health of the entire cluster as well as each node’s well-being, tracking various metrics akin to those found in Hadoop, Spark, and JVM. Upon accessing the Ganglia net UI in a browser, users are presented with a comprehensive overview of their EMR cluster’s performance, featuring graphical representations that detail load, memory utilization, CPU usage, and network traffic metrics. Notwithstanding Ganglia’s impending deprecation announced by its maintainers, FINRA felt compelled to provide this response.

Resolution overview

The insights gleaned from the published research significantly informed and enhanced our overall strategic approach. Publishing showcased techniques for setting up a monitoring system leveraging Kubernetes and Prometheus to effectively monitor an EMR cluster, and utilize Grafana dashboards to visualize metrics and identify areas for optimization.

Based on these key insights, we successfully demonstrated a viable proof-of-concept. We subsequently developed an enterprise-wide monitoring solution by integrating Managed Prometheus and Managed Grafana to replicate Ganglia-like metrics at FINRA, effectively enhancing the organization’s ability to visualize and analyze key performance indicators. Managed by Prometheus, scalable solutions enable real-time high-volume information collection, effortlessly adapting to workload fluctuations through efficient ingestion, storage, and querying of operational metrics. Metrics are then fed into the Managed Grafana workspace, where they’re transformed into actionable visualizations.

Our solution incorporates an information ingestion layer for each cluster, featuring customization options for metric collection through a bespoke script stored on Amazon S3. To further enhance monitoring capabilities, we’ve implemented the automatic deployment of Managed Prometheus at EC2 instance startup through a carefully crafted bootstrap script on Amazon EMR instances. Furthermore, application-specific tags are meticulously detailed in the configuration file to streamline optimization and effectively capture the desired metrics.

After collecting metrics using Managed Prometheus on Amazon Elastic MapReduce (EMR) clusters, the data is transmitted to a remote Managed Prometheus workspace for processing and analysis. Managed Prometheus workspaces are logical and remote environments dedicated to hosting specific metrics for associated Managed Prometheus servers. Additionally, they provide entry-level management for authorizing who or what sends and receives metrics within that workspace. You may create additional workspaces by account or software depending on your needs, which enables better management.

After collecting the necessary metrics, our team developed a solution to visualize these insights on Managed Grafana dashboards, enabling seamless integration with endpoints for further analysis and decision-making. We designed customized dashboards for task-level, node-level, and cluster-level metrics, enabling seamless promotion from decreased environments to increased ones. We also developed a suite of templated dashboards that provide detailed, node-level insights into system performance, including CPU usage, memory availability, network activity, disk input/output, HDFS statistics, YARN metrics, Spark metrics, and job-level metrics for both Spark and JVM processes, thereby empowering users to gain a comprehensive understanding of their environments through automated metric aggregation in each account.

By opting for a SAML-based authentication approach, we streamlined integration with existing Active Directory teams, thereby minimizing the effort required to manage individual onboarding and granting access to user-specific Grafana dashboards. We established a trio of key teams: admins, editors, and viewers – all designed to streamline person authentication in Grafana, with roles tailored to individual user profiles.

Through meticulous automation of monitoring processes, key performance indicators (KPIs) are efficiently transmitted. We utilize CloudWatch to trigger mandatory alerts whenever a metric surpasses its predetermined thresholds.

The following diagram illustrates the answer structure.

Pattern dashboards

Screenshots of instance dashboards are displayed below.

Conclusion

By leveraging FINRA’s data-driven approach, complete EMR workload observability was achieved, enabling optimized efficiency, ensured reliability, and valuable insights into massive data operations, ultimately leading to operational excellence.

FINRA’s solution empowered operations and engineering teams to leverage a unified dashboard for monitoring massive data sets, swiftly identifying potential operational issues. The scalable solution significantly reduced the time-to-decision, thereby enhancing our overall operational posture. The revised text provides empowered operations and engineering teams with comprehensive visibility into diverse Amazon EMR performance indicators, including operating system configurations, Apache Spark applications, Java Management Extensions (JMX), Hadoop Distributed File System (HDFS) metrics, and YARN data, all accessible from a unified platform. We further expanded our solution to accommodate scenarios relevant to Amazon EKS clusters, including EMR on EKS clusters and other features, solidifying its position as a comprehensive platform for monitoring metrics across our infrastructure and services.

Concerning the Authors

Serves as a Senior Director of Know-how at FINRA. She oversees large-scale information operations, responsible for managing massive datasets of petabytes in size and complex workloads that require processing in the cloud infrastructure? Moreover, she is an knowledgeable in creating Enterprise Software Monitoring and Observability Options, Operational Information Analytics, & Machine Studying Mannequin Governance work flows. Outside of work, she delights in practicing yoga, nurturing her passion for singing, and sharing her expertise by teaching.

As a seasoned professional, I lead engineering efforts as a marketing consultant for FINRA, leveraging my expertise to develop robust and flexible solutions that drive success. He has dedicated himself to enhancing infrastructure reliability by developing tailored monitoring solutions and striving to maximize system efficiency through meticulous optimization efforts. In his free time, he appreciates spending quality hours at home and often explores innovative approaches to learning.

Serves as Director of Market Regulation Know-how at FINRA. As a seasoned expert in large-scale data processing, he excels in designing cutting-edge solutions that seamlessly integrate workload optimization, information management, and computational prowess. Akhil indulges in the thrill of sim racing and Formula 1 when he’s not occupied with other pursuits.

Problem

Resolution overview

Pattern dashboards

Conclusion

Concerning the Authors

Related Articles

Robotic Movies: SCUTTLE Robotic, Laundry Folding, and Extra

Cellular Phishers Goal Brokerage Accounts in ‘Ramp and Dump’ Cashout Scheme – Krebs on Safety

Agent Bricks in Motion: Smarter Provide Concentrating on for Gross sales Groups

LEAVE A REPLY Cancel reply

Latest Articles

Robotic Movies: SCUTTLE Robotic, Laundry Folding, and Extra

Cellular Phishers Goal Brokerage Accounts in ‘Ramp and Dump’ Cashout Scheme – Krebs on Safety

Agent Bricks in Motion: Smarter Provide Concentrating on for Gross sales Groups

Dion: the distributed orthonormal replace revolution is right here

Evaluation: iFlight Heated Battery Case – Hold Your LiPos Heat in Chilly Climate