Saturday, December 14, 2024

Amazon EMR on Amazon EC2 clusters utilize resources effectively when monitored and optimized regularly. By leveraging Amazon Athena and Amazon QuickSight, you can gain valuable insights into your cluster’s performance.

Obtaining fine-grained insights into application-level pricing within clusters opens up opportunities for buyers seeking to maximize resource utilization, implement accurate cost allocation, and develop effective chargeback strategies. By dissecting and optimizing the utilization of specific personnel roles within your electronic medical record (EMR) cluster, you can unlock numerous benefits.

  • Software-level price insights enable organisations to strategically plan and manage their workload effectively. Effective resource allocation decisions are likely to be made when the price implications are fully understood, ultimately leading to enhanced overall cluster efficiency and cost-effectiveness.
  • With granular price attribution, organisations can pinpoint cost-saving alternatives tailored to specific individual needs. They will reengineer underutilized sources or focus optimisation initiatives on purposes driving excessive utilisation and costs.
  • In multi-tenant settings, companies can effectively deploy transparent pricing models that accurately allocate costs based on individual utility resource utilization and associated expenses. By implementing this system, we promote accountability among tenants while enabling accurate chargeback resolutions.

Here is the rewritten text:

This publication guides you through deploying a comprehensive solution in your environment for investigating Amazon EMR on EC2 cluster usage. By leveraging this resolution, you will gain a profound comprehension of resource utilization patterns and associated costs for individual applications running within your EMR cluster. By leveraging optimised pricing strategies, enforcing transparent billing practices, and making informed decisions regarding workload management, you can ultimately boost the overall efficiency and cost-effectiveness of your Amazon EMR environment? This resolution has primarily been evaluated using Spark workloads running on Amazon Elastic MapReduce (EMR) instances on Amazon Elastic Compute Cloud (EC2), leveraging YARN as its resource manager. While it’s been studied extensively in the context of standalone frameworks, the impact of differing workloads on YARN-based systems, such as those featuring Hive or Tez, remains unexplored.

Resolution overview

The solution functions by running a Python script on the EMR cluster’s primary node, which collects metrics from the YARN resource manager and correlates them with cost usage details from AWS Cost and Usage Reports (CUR). The script, triggered by a cron job, initiates HTTP queries to the YARN resource manager, collecting metrics from designated pathways. /ws/v1/cluster/metrics for cluster metrics and /ws/v1/cluster/apps for utility metrics. The cluster metrics encompass utilization data for cluster sources, while the appliance metrics comprise utilization information about a specific utility or job. Metrics are stored in an Amazon S3 bucket.

Two YARN metrics capture valuable information regarding the resource utilization of a utility or job, namely?

  • The memory usage (in megabytes) allocated to a utility instance is directly proportional to the number of seconds the appliance was operational.
  • The total duration in seconds that a utility instance was allocated with the corresponding number of yarn vcores.

The calculation utilizes memory seconds to determine the operational cost of the device or task. If necessary, the modification could utilize vcoreSeconds as a fallback option.

Metadata from YARN metrics stored in Amazon S3 is processed, consolidated, and presented in a structured format as a database with defined tables, making the data readily accessible for further analysis and processing. Here is the rewritten text:

To integrate data insights, you can craft SQL queries in Amazon Athena that harmonize YARN metrics with pricing details from AWS Cost and Usage Report (CUR) to drill down on the specific cost allocation for your Elastic MapReduce (EMR) cluster by infrastructure and utility. The resolution generates two mirrored Athena views displaying price breakdowns, serving as a foundation for data visualization purposes.

The diagram that follows will illustrate the fundamental components of the solution.

EMR Cluster Usage Utility Solution Architecture

Conditions

To carry out the answer, you want the next stipulations:

  1. Is a CUR (Cost and Usage Report) created within your Amazon Web Services account? Can we establish an Amazon S3 bucket to store and manage our report data? Observe the steps outlined below to create a Curriculum Overview Report (CUR) for your organization.

    (Note: I made some minor changes to make it more concise and clear, adding a few words to improve readability and flow.) Upon preparing the report, ensure that the following parameters are activated:

    • Embrace useful resource IDs
    • Time granularity at an hourly level?
    • Report knowledge integration to Athena

The configuration of your Amazon Web Services (AWS) account may require up to 24 hours to propagate and make your S3 bucket accessible. After that, your CUR will automatically update at least once a day.

  1. Athena wants you to execute queries against the data stored in Amazon Open Distro for Elasticsearch using standard SQL syntax? AWS provides a template, `crawler-cfn.yml`, which is automatically generated in the same S3 bucket during CUR creation, thereby streamlining the process of mixing Athena with CUR. Can we leverage Athena’s AI-driven insights to inform and optimize CUR’s comprehensive risk assessments, thereby enhancing predictive modeling capabilities and streamlining underwriting processes?

    This template creates a database that references a CUR (Cost and Usage Report), an occasion, and an AWS Glue crawler. The latter gets triggered by an S3 event notification when the CUR becomes up-to-date, replacing the AWS Glue database accordingly.

  2. Ensure activation of the AWS-generated price allocation tag. aws:elasticmapreduce:job-flow-id. This permits the sphere, resource_tags_aws_elasticmapreduce_job_flow_idWithin the CUR, the EMR cluster ID is populated and utilized by SQL queries in the resolution process. To activate the price allocation tag from the administration console, follow these steps:
    • Access the Payer’s AWS Management Console by signing in to their administrator account. Open the
    • Within the navigation pane, select
    • Below , select the aws:elasticmapreduce:job-flow-id tag
    • Select . Please allow up to 24 hours for tag activation.

The subsequent screenshot illustrates a singular exemplar of aws:elasticmapreduce:job-flow-id tag being activated.

CostAllocationTag

Can you now check out this resolution on an EMR (Amazon’s Elastic MapReduce) cluster in a lab environment? If you are not already familiar with Electronic Medical Records (EMRs), carefully follow the comprehensive instructions provided to establish a fresh EMR cluster and execute a pattern Spark job.

Deploying the answer

Deploy to implement the solution following the procedures outlined in the next sections.

The scripts you’re running on the EMR cluster are crucial for processing data efficiently. To ensure seamless execution, I recommend creating a script wrapper that handles errors and retries, ensuring your jobs run smoothly.

Retrieve two scripts from the repository and upload them securely to an Amazon Simple Storage Service (S3) bucket for secure storage and sharing.

  • emr_usage_report.py What’s the status of yarn resources?
  • emr_install_report.sh  cron 0 * * * * /usr/bin/python /path/to/your/script.py

To update scripts, add a step to the EMR cluster via the console or using AWS CLI commands. aws emr add-step command.

Substitute:

  • REGION With the AWS regions where the cluster is operating, for instance, Europe (Ireland)? eu-west-1)
  • MY-BUCKET With the identity of the bucket where the script is saved – for instance, a S3 bucket named “my-script-store” – you can use Amazon’s AWS SDK to programmatically determine whether the script has been updated since its last deployment. my.artifact.bucket)
  • MY_REPORT_BUCKET With the Hadoop Distributed File System (HDFS), identify the specific location where you want to collect YARN metrics for instance. my.report.bucket)
aws emr add-steps --cluster-id j-XXXXXXXXXXXXX --steps "Kind=CUSTOM_JAR&Identify=Set up YARN reporter&Jar=s3://.elasticmapreduce/libs/script-runner/script-runner.jar&Args=[s3:///emr-install_reporter.sh,s3:///emr_usage_reporter.py,]"

Spark jobs can now be run in your Amazon EMR cluster to initiate the generation of utility consumption metrics.

Launching the CloudFormation stack

Once the stipulations are satisfied, and scripts are successfully deployed, allowing EMR clusters to transmit YARN metrics to an S3 bucket, the subsequent deployment is achieved through CloudFormation.

Prior to deploying the stack, upload a copy of this data to an Amazon S3 bucket as mandated by the CloudFormation template, enabling initial analysis in QuickSight. Once prepared, proceed to deploy the stack to instantiate the residual data sources.

  1. Select

This functionality consistently deploys an AWS CloudFormation template within your existing AWS account. Please confirm that you want to check-in with our project manager and set up a dedicated area for this new task?

The CloudFormation stack necessitates several parameters, as evident in the accompanying screenshot.

CloudFormationStack

The new desk outlines the specifications.

Stack identify

A crucial identifier for a stack, for instance, EMRUsageReport

YARNS3BucketName The identity of the S3 bucket where YARN (Yet Another Resource Negotiator) metrics are saved is determined by the configuration of the Hadoop cluster. Typically, this information can be found in the `yarn-site.xml` file or the `hadoop-site.xml` file within the Hadoop configuration directory.

For instance, you may find a property like this:

“` yarn.metricssink
s3://my-metrics-bucket
“`

This specifies that YARN metrics should be written to the `my-metics-bucket` S3 bucket.

CURDatabaseName The Identify of Value (IOV) Utilization Report, a comprehensive database within the AWS Glue framework, aggregates and analyzes data from various sources to provide actionable insights on value realization across an organization. With seamless integration into AWS services, this report enables data-driven decision-making by providing real-time visibility into IOV metrics, such as revenue attribution, customer lifetime value, and key performance indicators.
CURTableName The following report provides insights into the utilization of values across your data catalog within AWS Glue:

Utilization Report for AWS Glue Data Catalog

Date: [insert date]
Catalog Name: [insert catalog name]

**Value Utilization Summary**

| Category | Count |
| — | — |
| Total Values | 1,234,567 |
| Unique Values | 123,456 |

**Top Value Categories**

| Category | Count |
| — | — |
| String | 654,321 |
| Integer | 341,875 |
| Decimal | 138,421 |

**Value Utilization by Table**

| Table Name | Count |
| — | — |
| table1 | 123,456 |
| table2 | 65,432 |
| table3 | 34,185 |

**Top Value Usage**

| Table Column | Count |
| — | — |
| column1 | 54,321 |
| column2 | 27,142 |
| column3 | 13,821 |

This report provides valuable insights into the utilization of values within your AWS Glue data catalog.

EMRUsageDBName The following Amazon Glue database is identified for the EMR Value Utilization Report:

Glue Database Name: emr_value_utilization_db

Description: This Amazon Glue database is designed to store and manage data for the EMR Value Utilization Report. It will contain tables that track EMR cluster utilization metrics, including job submission rates, execution times, and memory usage. The database schema will be optimized for querying and data retrieval.

EMRInfraTableName To track infrastructure utilization metrics on AWS Glue, we propose creating a dedicated dashboard named “Infrastructure Metrics Monitor”?
EMRAppTableName A descriptive dashboard awaits on Amazon Web Services (AWS) Glue! Here’s the revamped text:

“A data visualization dashboard is set to be crafted within AWS Glue, where utility consumption metrics will be meticulously tracked and presented in an easy-to-read format.”

QSUserName What identity should I use as a QuickSight user within the default namespace to manage EMR Utilization Report sources effectively?
QSDefinitionsFile What is the S3 URI of the definition JSON file used to generate the EMR Utilization Report?
  1. Provide the necessary parameters from the preceding work session.
  2. Select .
  3. During subsequent display screens, enter necessary tags, an AWS ID, and entry administration (IAM) functions, or consider stack failure, as needed for optimal system performance and security. Unless otherwise specified, you may leave them as default.
  4. Select .
  5. The small print on the ultimate display screen reveals the examination containers. Confirming AWS CloudFormation’s capabilities in creating IAM sources, we find that it can indeed create these resources with customised names, necessitating the following: CAPABILITY_AUTO_EXPAND.
    CloudFormationCheckbox
  6. Select .

The stack will require several minutes to generate the remaining resources for its response. Upon creation of a CloudFormation stack, you will discover a detailed summary of the resources generated in the ‘Resources’ tab.

Reviewing the correlation outcomes

The CloudFormation template develops two Athena views that present a detailed, correlated analysis of YARN cluster and utility metrics in conjunction with data sourced from AWS Cost and Usage Reports (CUR). The CUR aggregates prices hourly, as a result, its correlation with the operating cost of a utility is primarily based on the prorated hourly operating expense of the EMR cluster to derive the price of operating the utility.

The following screenshot displays the Athena view for a detailed breakdown of YARN cluster metrics by correlated price, showcasing vital information about the performance and efficiency of the Hadoop ecosystem.

CorrelationResults

What are the key fields within the Athena view for YARN cluster metrics?

cluster_id string ID of the cluster.
household string A valuable asset for grouping related information effectively. Potential values include computing occasions, elastic mapping cutbacks, storage and knowledge switches.
billing_start timestamp Commence tracking billable hours for valuable resources.
usage_type string A specific type or instance of a valuable resource equivalent to an m5.xlarge compute instance.
price string The value inherent in a valuable resource.

The subsequent screenshot displays a detailed breakdown of correlated price specifics for YARN utility metrics in the Athena view.

CostBreakdownYARNAppMetrics

The next section details the various fields found in the Athena view for YARN utility metrics.

cluster_id string ID of the cluster
id string Unique Serial Number of the Appliance Operates
person string Person identify
identify string Identify of the appliance
queue string The Hadoop YARN (Yet Another Resource Negotiator) is a framework that manages resources efficiently in a cluster. It separates the role of job scheduling and cluster management into two distinct components: NodeManager and ResourceManager. The ResourceManager identifies the most suitable NodeManager to execute an application based on available resources, such as CPU, memory, and disk space.
finalstatus string Ultimate standing of utility
applicationtype string Kind of the appliance
startedtime timestamp Begin time of the appliance
finishedtime timestamp Finish time of the appliance
elapsed_sec double Estimated operating time: 30 minutes
memoryseconds bigint The memory allocated to each utility instance represents the number of seconds that the appliance executed.
vcoreseconds int The number of YARN vcores allocated to an application instance corresponds to the duration of time the application was running.
total_memory_mb_avg double Total storage capacity in megabytes allocated to the cluster within the next hour.
memory_sec_cost double Derived unit price of memoryseconds
application_cost double The pricing model for appliances is largely dependent on processing time, not memory seconds.

Derived price related to appliance primarily based on processing time and other relevant factors such as production costs, market demand, and competitor prices.

total_cost double What are the total costs associated with the cluster for this hour?

Constructing your personal visualization

In Amazon QuickSight, the CloudFormation template provisions two datasets that leverage Athena views as data sources and enables pattern evaluations. The evaluation of patterns involves a thorough examination of two distinct worksheets. EMR Infra Spend and EMR App Spend. Utilizing a preconfigured bar chart and pivot tables, users can efficiently leverage datasets to create customized visualizations that effectively present detailed pricing information about their EMR clusters.

EMR Infra Spend References to yarn cluster metrics datasets are provided below. A filter for date range selection exists, as well as one for choosing cluster IDs. The bar chart illustrates a comprehensive breakdown of source prices by cluster over time intervals. The pivot desk helps users break down their daily expenses to identify every single expenditure.

The next screenshot reveals the EMR Infra Spend Sheet from a pattern evaluation, generated by a CloudFormation template.

EMR App Spend Sheet references to yarn utility metrics? Filters exist for both date range selection and cluster ID options. The pivot table on this sheet illustrates how to utilize fields within the dataset to present a breakdown of pricing details by customer, enabling analysis of executed purposes, their efficiency, duration, and derived cost of each run.

The next screenshot reveals the EMR App Spend Sheet from a pattern evaluation generated by the CloudFormation template.

Cleanup

Delete your source files to avoid unnecessary expenses. To thoroughly verify and refine your sources, follow these subsequent steps:

  1. Delete the stack that was recently created using the provided template on the AWS CloudFormation console.
  2. Terminate the EMR cluster
  3. To remove a S3 bucket used for YARN metrics, follow these steps:

    1. Log in to your AWS Management Console account.
    2. Navigate to the Amazon S3 dashboard and select the bucket that you want to delete from the list of available buckets on the right side of the page.
    3. Click the Actions dropdown menu and select Delete bucket.

Conclusion

This publication outlines how to develop a comprehensive cluster utilization reporting solution, providing granular insights into the resource consumption and associated costs of individual applications running within your Amazon EMR on EC2 cluster. By leveraging the capabilities of Athena and QuickSight, organisations can now seamlessly correlate YARN metrics with pricing utilisation details from their Value and Utilisation Report, thereby empowering data-driven decision making. By leveraging these findings, organisations can streamline their resource utilisation, introduce transparent and accurate billing models aligned with actual usage patterns, ultimately achieving enhanced cost-effectiveness within their electronic medical record systems. This resolution unlocks the full potential of your EMR cluster, consistently enhancing knowledge processing and analytics workflows while maximising return on investment.


In regards to the authors

Serves as a Senior Technical Account Manager at Amazon Web Services (AWS). As a seasoned professional, he collaborates meticulously with Enterprise Assist clients, providing expert guidance and technical expertise to help design and implement optimized AWS environments that leverage industry-leading best practices for seamless operational excellence. Based mostly in Singapore, Boon Lee has over 20 years of expertise in IT & Telecom industries.

is a Sr. Analytics Specialist – Options Architect at Amazon Web Services (AWS), Philippines, with expertise in large-scale data and analytics. With expertise in crafting and deploying robust, secure, and budget-friendly knowledge solutions, she assists clients in successfully transitioning and refining their substantial knowledge and analytics infrastructures onto Amazon Web Services (AWS). With unwavering passion, she drives organizations to liberate the full value of their intellectual assets.

is the Head of Knowledge & AI Resolution Structure for ASEAN at Amazon Net Providers (AWS). With more than 15 years of distinguished experience in the realms of knowledge and artificial intelligence, he serves as a visionary leader, skillfully harnessing his expertise to catalyze innovative growth and expansion in this sphere. Vikas is passionate about empowering clients to achieve success in their digital transformation endeavors, with a focus on innovative cloud-based solutions and emerging technologies that drive results.

Serving as a Massive Knowledge Resolution Architect at Amazon Web Services. He exhibits an infectious passion for distributed systems, open-source technologies, and cybersecurity. As a specialist, he collaborates extensively with global clients to architect, develop, and refine secure and scalable data pathways utilizing Amazon EMR’s capabilities.

Previous article
Parallelizing your code with Python just got a whole lot easier. With a multitude of libraries out there, you’re spoiled for choice when it comes to getting the most out of your CPU’s multithreading and multiprocessing capabilities. Here are some of the top contenders: Dask: A flexible parallel computing library that seamlessly scales up existing serial code by distributing tasks across multiple cores or even machines. Its modular architecture makes it easy to integrate with other libraries, and its intuitive API means you can get started quickly. Joblib: A set of simple but powerful tools for executing batches of functions in parallel using Python’s global interpreter lock (GIL). It’s lightweight, flexible, and easy to use, making it a great choice for simple parallel processing tasks. Pathos: A library that provides high-level interfaces for multiprocessing, parallelism, and concurrency. With its focus on simplicity and ease of use, Pathos is perfect for those who want to get started with parallel programming quickly without having to worry about the nitty-gritty details. Parallel Python: An open-source implementation of the Parallel Virtual Machine (PVM) standard, which enables distributed computing across a network of machines. With its support for both shared-memory and message-passing models, you can tackle complex tasks that require intense computational power. Ray: A high-performance distributed computing framework that allows you to easily scale your Python applications by distributing compute-intensive tasks across a cluster of machines. Its flexible architecture makes it suitable for both CPU-bound and GPU-bound computations. NumPy + multiprocessing: For those who are already familiar with NumPy, the standard library’s multiprocessing module is an excellent choice. It provides a straightforward way to parallelize CPU-bound operations using multiple cores or even machines. So, which one will you choose?
Next article

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles