Provides seamless access to run large-scale data processing frameworks such as Apache Spark and Apache Hive, eliminating the need for cluster management and server upkeep. EMR Serverless empowers you to execute analytics workloads at any scale, leveraging automated scaling that dynamically adjusts resources within seconds to accommodate fluctuating data volumes and processing demands.
We have successfully launched employee metrics for our EMR Serverless platform. This feature enables monitoring of vCPUs, memory, ephemeral storage, and disk I/O allocation and utilization metrics at a combined employee level for Spark and Hive jobs.
This article forms a part of a series on EMR Serverless observability. On this blog post, we concentrate on strategies for utilizing CloudWatch metrics to monitor EMR Serverless instances in near real-time.
CloudWatch metrics for EMR Serverless
On a per-Spark job basis, EMR Serverless publishes novel metrics to CloudWatch for each driver and executor instance. These metrics offer detailed, granular insights into job efficiency, identifying potential bottlenecks and optimizing resource allocation.
WorkerCpuAllocated | What’s the total number of virtual CPU cores assigned to a job that runs for a specific team? |
WorkerCpuUsed | The total number of vCPU cores used by staff members during a job execution. |
WorkerMemoryAllocated | What benefits do staff receive from their job in terms of total remuneration package? |
WorkerMemoryUsed | The entire cognitive resources allocated by employees during work operations? |
WorkerEphemeralStorageAllocated | What amount of temporary memory allocation does each task within a project utilize? |
WorkerEphemeralStorageUsed | Bytes of transient memory employed by personnel throughout job execution? |
WorkerStorageReadBytes | Bytes stored during work hours are diverse. |
WorkerStorageWriteBytes | Total volume of data written by staff during a job execution? |
The key benefits of monitoring your EMR Serverless jobs using CloudWatch include?
- Discover actionable intelligence on resource allocation best practices, enabling you to refine your EMR Serverless settings for enhanced efficiency and cost reductions with tangible benefits. By identifying the underutilization of virtual CPUs (vCPUs) and reminiscence, organizations can uncover hidden opportunities for cost reduction by optimizing employee sizes and realizing tangible value savings.
- You may identify and proactively address the underlying causes of pervasive errors without needing to delve into complex log analysis. To optimize system performance, consider implementing proactive measures to manage ephemeral storage use, such as dynamically assigning additional storage capacity to employees based on their needs.
- CloudWatch provides near-real-time monitoring capabilities, enabling you to track the performance of your Amazon EMR serverless jobs in real-time, allowing for swift identification of any inefficiencies or anomalies.
- CloudWatch enables you to set up alarms using Amazon Simple Notification Service (SNS), allowing you to receive notifications via email or text message when specific metrics reach critical levels.
- CloudWatch stores historical data, allowing you to analyze trends over time, identify patterns, and inform intelligent decisions for capacity planning and workload optimization.
Answer overview
To further enhance our observability skills, we’ve developed a solution that consolidates these metrics onto a unified CloudWatch dashboard for our EMR Serverless infrastructure. Consider launching a separate template for each EMR Serverless utility you plan to utilize? You can monitor all submitted roles for a single EMR serverless utility using the same CloudWatch dashboard across identical instances. To delve deeper into the features of this dashboard and implement its solutions in your personal account, refer to the documentation.
Let us guide you through the process of leveraging this dashboard to achieve the following goals:
- Streamline resource allocation to minimize waste without compromising productivity.
- Diagnose and rectify pervasive faults without necessitating exhaustive log analysis to achieve optimal error resolution.
Conditions
To execute the pattern jobs listed on this platform, consider commencing with default settings using either AWS CLI or AWS SDK, and then deploy the CloudFormation template by providing the EMR Serverless utility ID as input to the template.
You will need to submit all the roles on this setup to the same EMR Serverless utility. If you want to track a specific service, consider deploying this template on your own EMR Serverless instance using the unique utility ID.
Optimize useful resource utilization
When executing Spark jobs, a common starting point typically involves leveraging the default configuration settings. Without clear insight into resource consumption, optimizing your workload can be a daunting task? While clients often customize their Spark clusters by adjusting various parameters, some of the most common tweaks involve tweaking spark.driver.cores, spark.driver.memory, spark.executor.cores, and spark.executors.memory.
As an instance, how newly added CloudWatch dashboard worker-level metrics can also help you fine-tune your job configurations for higher price-performance and enhanced resource utilization, let’s run the next step, which leverages the dataset to execute transformations and aggregations effectively.
aws emrserverless create-notebook –name my-notebook –instance-type ml.m5.xlarge –runtime Zeppelin –kernel zeppelin-python-3 –notebook-file file:///path/to/notebook.ipynb You should provide your Amazon S3 bucket name and EMR Serverless utility ID that were generated when you launched the CloudFormation template. Your Amazon S3 bucket is likely in the format “your-bucket-name-1234567890abcdef” and your EMR Serverless utility ID will be a unique identifier starting with “emr-“. Ensure that you utilize the same utility ID for submitting all pattern job posts on this platform. I am a professional editor.
Now, let’s review the executor vCPUs and memory metrics from the CloudWatch console.
The job was initially executed with default settings for Amazon EMR’s serverless Spark environment. According to the earlier screenshot’s metrics, a total of 396 vCPUs were allocated for the job, comprising 99 executors each with 4 vCPUs. Notwithstanding, the job utilized at most 110 vCPUs predominantly. Oversubscription of virtual CPU (vCPU) resources can lead to poor performance and resource contention within a cloud infrastructure or virtualized environment. The system was allocated a total of 1,584 gigabytes of memory. Notwithstanding the metric’s findings, it appears that the job exclusively utilized 176 GB of memory during its execution, suggesting a scenario of memory oversubscription.
Let’s re-run this job with the next adjusted configurations?
spark.executor.reminiscence | 14 GB | 3 GB |
spark.executor.cores | 4 | 2 |
spark.dynamicAllocation.maxExecutors | 99 | 30 |
Whole Useful resource Utilization |
6.521 vCPU-hours 26.084 memoryGB-hours 32.606 storageGB-hours |
1.739 vCPU-hours 3.688 memoryGB-hours 17.394 storageGB-hours |
Billable Useful resource Utilization |
7.046 vCPU-hours 28.182 memoryGB-hours 0 storageGB-hours |
1.739 vCPU-hours 3.688 memoryGB-hours 0 storageGB-hours |
We use the next code:
Let’s re-examine the executor metrics from the CloudWatch dashboard one more time for this job run.
Amongst the second job, we observe a decline in the allotment of each virtual CPU (396 versus 60GB and reminiscence: a comparison of (1,584GB versus The system’s increased storage capacity (120 GB) has resulted in a more efficient use of available resources, exceeding initial expectations. The unique job took a total of four minutes and 41 seconds to complete. The second task consumed 4 minutes and 54 seconds of time. This reconfiguration has yielded a substantial 79% reduction in financial expenditure without compromising job efficiency.
To further refine your role, utilize these metrics to strategically adjust either the number of personnel or allocated resources – possibly increasing or decreasing them as needed.
Diagnose and resolve job failures
Using CloudWatch’s dashboard, you can quickly diagnose job failures caused by factors such as CPU, memory, and storage limitations equivalent to running out of memory or no space left on the machine, thereby enabling swift establishment and resolution of common errors without having to examine logs or navigate through Spark History Server. By leveraging the dashboard’s insights, you can optimise resource utilisation, adjusting allocations only as needed to prevent oversubscription and subsequently reduce costs further.
Driver errors
Let’s create a large Spark DataFrame with millions of rows for testing purposes. Typically, this process concludes on the Spark driver’s end. Upon submitting the job, we also configure spark.rpc.message.maxSize
As a direct consequence, serializing large knowledge frames with numerous columns is essential for job requirements.
After a few minutes, the job terminated due to an error message reading, “Encountered errors when releasing containers,” as detailed in the section that follows.
When faced with unclear error messages, it is crucial to delve deeper by examining the root cause and relevant log files to effectively troubleshoot further. Before delving into logs, it’s essential to review the CloudWatch dashboard and analyze the underlying driver metrics, since container releases are often triggered by these drivers.
The data reveals that the brackets are correctly positioned within their designated parameters. Notwithstanding the check, it became apparent that the driving force was fully utilizing the allocated 16 GB of memory. By default, EMR Serverless drivers are assigned 16 GB of memory.
Let’s re-run the job with additional driver memory allocated. Let’s set driver reminiscence to approximately 27 gigabytes, which will serve as the starting point. spark.driver.reminiscence + spark.driver.memoryOverhead
Must storage capacity be below 30 gigabytes for the default employee type? park.rpc.messsage.maxSize
will probably be unchanged.
The task was accomplished successfully this occasion. Let’s take a look at the CloudWatch dashboard to monitor driver memory usage effectively.
The allocated reminiscence has been expanded to 30 GB; concurrently, the actual memory usage by the driver did not surpass 21 GB throughout the execution. Due to this factor, we are able to further optimise prices here by reducing their value spark.driver.reminiscence
. We re-executed the identical task with spark.driver.reminiscence
Set to 22 GB, the job still succeeded despite higher driver memory utilization.
Executor errors
Using CloudWatch for observability can be effective in pinpointing issues related to drivers, since each job typically has only one driver and utilizing the accurate source resource usage of that single driver. Executor metrics are consolidated across the entire employee base, providing a comprehensive view of performance. While designing the dashboard, prioritize providing a manageable number of resources to ensure job success, thereby preventing over-subscription and optimizing resource allocation.
As a demonstration, let’s simulate uniform disk overutilization across all staff by processing extremely large NOAA datasets spanning multiple years. This job temporarily caches a massive amount of data to disk.
Following a brief interval, it became apparent that the task had terminated prematurely due to a “No area left on machine” error, indicating that some team members had exhausted their available disk space.
According to the dashboard metrics, a total of 99 executor staff were active. All employees are initially provided with a standard 20 gigabytes of storage space.
As a direct consequence of this scenario, a Spark job failure may occur. To better understand the root cause, we should scrutinize the and metrics displayed on the dashboard, taking into account that the driving force won’t execute any tasks if the situation persists?
As data reveals, the 99 executors have utilized a total of 1,940 GB out of the entire allocated executor storage capacity of 2,126 GB. The storage utilized for caching the information content alongside containing each the data shuffled by the execution entities. The actual usage of the 2,126 GB capacity isn’t reflected in this graph, possibly due to some or all of the 99 executors not having processed substantial data before the job failure, leaving them with minimal storage requirements.
Let’s run the job again with increased executor disk size using the ‘–executor-memory’ parameter. spark.emr-serverless.executor.disk
. Shall we start by allocating approximately 40 gigabytes of disk space to each executor as our initial testing configuration?
The task was executed with ease. Let’s review the key performance indicators (KPIs).
is now 4,251 GB as a result of we’ve doubled the value of spark.emr-serverless.executor.disk
. Although the total aggregated storage for executors has doubled, the job still utilized only about 45% (1,940 GB out of 4,251 GB) of available capacity. This indicates that our execution team has been working solely within a limited storage space, measured in gigabytes. Given that circumstance, we can endeavor to establish spark.emr-serverless.executor.disk
As a cost-saving measure, we propose a reduction in storage capacity from 40 GB to 25-30 GB, thereby avoiding unnecessary expenses on storage pricing that have arisen from our previous scenario. As well as monitoring and testing, you can identify whether your job is I/O-intensive by examining its resource utilization. To optimize your job’s I/O performance, consider leveraging the inherent characteristics of Amazon EMR Serverless, which can help streamline your workload’s input/output operations.
The dashboard provides a useful overview of transient storage usage, enabling administrators to monitor and manage details on caching and persistence, as well as disk-based scenarios that may arise. The Spark Historical past Server tab does not cache any actions, as evident in the accompanying screenshot. Despite being temporarily stored on Spark History Server, this data will likely become inaccessible once the cache is cleared or the job completes? Because of this, the feature can be leveraged to conduct a retrospective analysis of a job that has failed due to temporary storage limitations.
The information was uniformly dispersed across all involved parties. Despite aggregation of storage data across 99 executors, an information skew can still occur when only one or two executors contribute to a disproportionately large amount of data, resulting in disk space exhaustion, which CloudWatch’s dashboard may not accurately capture due to its aggregated view. To accurately diagnose performance levels for a specific individual executor, it is essential to examine metrics tailored to their unique role and responsibilities. Through seamless integration with EMR Serverless, we uncover compelling instances where per-worker-level metrics can empower organizations to identify, address, and resolve elusive issues.
Conclusion
Discovering best practices for streamlining and optimizing your EMR Serverless deployment requires leveraging a unified CloudWatch dashboard featuring enriched EMR Serverless metrics. Metrics are available across all AWS regions where EMR Serverless is located. Consult with our experts to gain further insight into this unique feature’s capabilities.
In regards to the Authors
is a Sr. As a seasoned Analytics Specialist with expertise in designing scalable architectures for large-scale data processing, I excel in harnessing the power of Amazon’s suite of analytics services, including Amazon EMR, AWS Lake Formation, AWS Glue, Amazon Athena, and Amazon DataZone. With more than a decade of proven expertise in the vast information sphere, he boasts comprehensive knowledge in designing robust and scalable solutions. As a subject matter expert in cloud-based solutions, he provides strategic guidance on architectural best practices and collaborates closely with clients to craft bespoke data strategies leveraging Amazon Web Services (AWS) analytics offerings, thereby unlocking the full value of their data assets.
is a Principal Associate Options Architect and Information & AI specialist at AWS. As a seasoned consultant, she crafts bespoke solutions for clients and partners to build robust, scalable, and secure infrastructure on Amazon Web Services (AWS), streamlining the migration of vast data sets, analytics, and artificial intelligence/machine learning workloads to the cloud.