This publish highlights the crucial lessons learned while assisting a global financial services provider in migrating their Apache Hadoop clusters to Amazon Web Services (AWS) and best practices that enabled them to reduce their Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3) costs by over 30% monthly.
Through effective partnerships with DevOps teams, we establish and implement cost-optimization strategies and operational best practices.
During this discussion, we also explored a data-driven approach centered around a hackathon focused on price optimization, leveraging the capabilities of Apache Spark and optimizing configurations for Apache HBase.
Background
In the early stages of 2022, a key business unit within a global financial services provider embarked on a significant initiative to migrate its customer-facing applications to Amazon Web Services (AWS). This comprised various internet applications, including knowledge repositories for Apache HBase, search clusters using Apache Solr, and data processing systems based on Apache Hadoop. The migration successfully relocated over 150 server nodes and a massive 1 petabyte of data, ensuring seamless continuity of operations. On-site clusters seamlessly facilitated the integration of real-time knowledge ingestion and batch processing capabilities.
As a result of the accelerated timeline driven by the impending closure of their data centers, the organization employed a lift-and-shift strategy to rehost their Apache Hadoop clusters on Amazon Elastic MapReduce (EMR) utilizing Elastic Compute Cloud (EC2) instances. This was reflected in the report.
Amazon EMR on EC2 provides enterprises with the flexibility to run their applications with minimal modifications on managed Hadoop clusters, featuring pre-installed Spark, Hive, and HBase software, ensuring seamless deployment and scalability for big data workloads. As a direct outcome of effective cluster management, organizations were able to break down their on-premises infrastructure into tailored, use-case-specific transient and chronic clusters on AWS without incurring additional operational burdens.
Problem
While the lift-and-shift strategy facilitated a low-risk migration for the enterprise unit, enabling engineers to focus on enhancing products, it was accompanied by an increase in recurring AWS expenses.
The enterprise unit successfully implemented both transient and chronic clusters to support diverse use cases across the organization. Several key infrastructure components depended heavily on Spark Streaming’s capabilities to process data in real-time, leveraging the scalability and reliability offered by persistent clusters. Additionally, they implemented the HBase configuration on persistent clusters to optimize storage and processing capabilities.
After an initial deployment, several configuration issues were uncovered, resulting in reduced efficiency and increased costs. Despite leveraging Amazon EMR’s managed scaling for persistent clusters, the configuration was not eco-friendly due to a minimum requirement of at least 40 nodes, thereby wasting resources. The core nodes had been inadvertently misconfigured to automatically scale. The unintended consequence was that instances of scaling in resulted in the shutdown of fundamental nodes possessing shuffle information. The enterprise unit also leveraged Amazon EMR to further streamline its operations. As a result of significant knowledge gaps regarding EMR deployment on EC2 clusters for Spark applications, job runs consistently exceeded their intended duration by approximately five instances. Right here, auto-termination insurance policies did not flag a cluster as idle because a job was still processing.
Previously, distinct environments were established for development (dev), user acceptance testing (UAT), and production (prod). These environments were over-provisioned with basic capability models, featuring overly generous managed scaling policies that ultimately led to higher costs, as illustrated below.
Brief-term cost-optimization technique
Within a remarkably concise timeframe of four months, the enterprise unit successfully completed the complex task of migrating purposes, databases, and Hadoop clusters. Their immediate goal was to expedite their exit from knowledge facilities, followed by optimizing costs and streamlining operations. Despite anticipating higher upfront costs due to the lift-and-shift approach, the company’s prices ultimately exceeded projections by a staggering 40%. What’s slowing them down needs to be optimised?
The team collaborated closely with both their organization’s workforce and Amazon Web Services (AWS) personnel to design an innovative cost-reduction strategy. Initially focusing on implementing cost-optimization best practices that did not necessitate the involvement of product development teams or impact their productivity. To identify the primary drivers of cost, they conducted a thorough price analysis, revealing that the most significant contributors were: EMR on EC2 clusters using Spark, EMR on EC2 clusters utilizing HBase, Amazon S3 storage costs, and EC2 instances running Solr.
The enterprise unit streamlined operations by automating the shutdown process for EMR clusters in their development environments. Utilizing Amazon EMR’s isIdle metrics, they endeavored to develop an event-driven response framework, as outlined in. They implemented a more rigorous strategy to eliminate clusters of underutilized resources within three hours, regardless of capacity utilization. They successfully updated managed scaling insurance policies in DEV and UAT environments, ensuring that clusters can scale up according to demand by setting the minimum cluster size for specific scenarios. This outcome yielded a 60% reduction in recurring development and user acceptance testing expenditures over a five-month period, as substantiated by the subsequent analysis.
For the preliminary manufacturing deployment, a subset of Spark jobs was successfully run on a persistent cluster utilizing an older Amazon EMR 5.x launch configuration. To enhance pricing efficiency, they partitioned workloads into smaller and larger tasks, assigning each group to a dedicated persistent computing cluster, thereby configuring the minimum number of core nodes necessary to support job execution across each cluster. By establishing core nodes in a robust dimensional framework, leveraging managed scaling for fewer process nodes effectively mitigates the issue of shuffled data loss. This improvement enabled faster scaling out and in, since process nodes do not store data in Hadoop’s Distributed File System (HDFS), thereby reducing dependencies.
What are the specific requirements for running Solr clusters on Amazon EC2? To maximize performance, they conducted thorough evaluations to identify the most suitable EC2 configurations for their specific workload.
Amazon S3 has played a significant role in keeping costs under control, with its vast repository of information – exceeding one petabyte – helping to drive down monthly expenses by over 15%. The enterprise unit successfully optimized storage costs by leveraging the storage class, resulting in a significant reduction of over 40% in monthly Amazon S3 expenses, as illustrated below. Additionally, they successfully migrated Amazon EBS volumes from gp2 to gp3 storage types, streamlining data management and improving performance.
Longer-term cost-optimization technique
After realizing preliminary cost savings from their enterprise unit, they collaborated with the AWS team to organize a financial hackathon event, FinHack. The objective of the hackathon was to leverage data-driven insights to develop and refine cost-optimization strategies for Spark-based applications, ultimately aiming to reduce costs further through process automation. To optimize hackathon arrangements, they identified a set of jobs to process using various Amazon EMR deployment options – including Amazon EC2 instances and configurations such as Spot, AWS Graviton, managed scaling, and EC2 instance fleets – to determine the most cost-effective solution for each task. The pattern check plan for this job has been verified and confirmed as follows: AWS personnel also helped analyze Spark configurations and job execution throughout the event.
Job 1 | 1 | The application submits a JAR file to Amazon EMR’s cluster using the command line interface. The Spark configuration is set to use the default values for the parameters like number of cores, memory size, and number of partitions. This script can be useful in scenarios where you need to run an EMR job with minimal Spark configuration setup. | Non Graviton, On-Demand Situations |
2 | The efficient execution of machine learning workloads in a serverless architecture utilizing Apache Spark and Amazon SageMaker Rekognition! To achieve this, you would leverage the scalable and cost-effective capabilities of AWS Lambda to run an EMR (Elastic MapReduce) on Serverless job with default Spark configurations. What’s your preferred approach: a Python script or a JSON file? | Default configuration | |
3 | What’s the best approach to execute an EMR (Elastic MapReduce) job on an Amazon EC2 instance using default Spark configurations, considering potential performance impacts from running this job on a Graviton-based instance? | Graviton, On-Demand Situations | |
4 | What’s the optimal approach to execute an Elastic MapReduce (EMR) job on Amazon Elastic Compute Cloud (EC2) instances, considering the unique characteristics of Arm-based Graviton processors? Can we leverage the default Spark configuration to simplify the process? Hybrid Spot Occasion allocation. | Graviton, On-Demand and Spot Situations |
During the FinHack, the enterprise unit conducted thorough testing using spot situations before and after. Initially, they employed sophisticated algorithms combining and to craft tailored fleet compositions, optimizing resource allocation for maximum efficiency. To optimize job execution, they employed automation by querying the AWS API for Spot placement scores before launching new jobs in the most suitable Availability Zone.
The FinTech team also developed a comprehensive EMR job monitoring script, providing granular insights into price per job and enabling continuous performance measurement. They utilised the database to track the standing of all transient clusters in their accounts and generated reports on an hourly basis for each job.
Upon executing the check plan, several additional opportunities for improvement emerged.
- The one-on-one checks initiate API queries to the Solr clusters, inadvertently triggering a performance bottleneck in the overall architecture. To prevent Spark jobs from overloading the clusters, they refined
executor.cores
andspark.dynamicAllocation.maxExecutors
properties. - Activity nodes were excessively provisioned with large Elastic Block Store (EBS) volumes. They reduced the capacity to 100 GB at an additional cost for enhanced financial efficiency.
- They have updated their occasion fleet configuration by setting units and weights proportionally based on the types of occasions chosen.
- During the initial migration process, teams established
spark.sql.shuffle.paritions
configuration too excessive. The configuration had been fine-tuned specifically for the individual’s on-premises cluster, but was no longer aligned with the updated settings of their Amazon Elastic MapReduce (EMR) clusters. They optimised the configuration by setting the value to a minimum of one or two times the number of vCores within the cluster.
Following the FinHack implementation, the organization enforced a price allocation tagging technique to categorize persistent clusters, which were provisioned using Terraform, from transient clusters, which were deployed utilizing Amazon Managed Wisdom Application Analytics (MWAA). They also deployed advanced analytics technologies.
Outcomes
The enterprise unit reduced its month-over-month pricing by 30 percent within a three-month span. This enabled the team to continue their migration efforts for residual on-premises workloads. The majority of their approximately 2,000 monthly jobs currently operate on EMR transient clusters. The company has further accelerated the adoption of AWS Graviton processors, now utilizing them for 40% of total usage hours each month, with an additional 10% allocation in non-production environments.
Conclusion
Through a data-driven approach combining price evaluations, adherence to AWS best practices, configuration optimization, and rigorous testing during a three-month financial hackathon, the global financial services provider achieved a significant 30% reduction in its AWS costs. To streamline operational efficiency, the organization implemented auto-termination insurance policies, leveraged managed scaling configurations to optimize resource utilization, harnessed Spot Situations for cost-effective computing, and adopted AWS Graviton instances for enhanced processing power. Additionally, fine-tuned Spark and HBase configurations enabled data-driven insights, while price allocation tagging and monitoring dashboards provided real-time visibility into costs. By partnering with AWS and prioritizing the implementation of best practices, they were able to successfully advance their cloud migration initiatives while optimizing costs for their massive data workloads running on Amazon EMR.
To maximize cost-effectiveness and achieve best-in-class results, we strongly recommend exploring.
With over 20 years of experience in IT, he serves as a Senior Options Architect at Amazon Internet Services in Southern California. He’s passionate about empowering businesses to unlock value through innovative technology solutions. Outside of work, he enjoys climbing and spending quality time with his family.
As a seasoned AWS Specialist and Resolution Architect with expertise in Analytics, she is passionate about empowering clients to unlock valuable insights from their data. Drawing upon his expertise, he crafts innovative solutions that equip organizations with the insights they need to make informed, data-backed decisions. Navnit Shukla is the sole creator of the ebook, highlighting his expertise in the field through a comprehensive showcase of his knowledge. Additionally, he operates a popular YouTube channel where he shares in-depth insights on cloud technologies and analytics. Join with him on .