Wednesday, April 2, 2025

What drives successful customer experiences on Amazon?

As a leading global provider of credit scoring solutions, we deliver credit risk assessment, fraud detection, targeted marketing tools, and automated decision-making capabilities to clients worldwide. As a pioneering organization in Amazon Web Services (AWS), we have proactively leveraged the power of cloud computing to propel our digital transformation initiatives forward. Our Cloud Center of Excellence’s (CCoE) global workforce manages a worldwide AWS Landing Zone, comprising a centralized AWS community infrastructure. As a valued AWS Partner, we are pleased to be recognized as an AWS PrivateLink Premier Partner, offering our clients seamless access to a diverse range of products through secure, private, and high-performance connectivity options.

Our e-join resolution platform leverages a suite of AWS services, including ALB, NLB, GWLB, Route 53, API Gateway, and CloudWatch, in conjunction with third-party security appliances to deliver a robust and scalable solution. The plethora of providers and sources, coupled with the substantial number of community site visitors across the platform, generates a vast array of logs that we require to consolidate and manage efficiently to enable swift analysis by our operations teams during platform troubleshooting efforts.

Our proprietary design, selected for its unparalleled ability to rapidly retrieve specific log entries from vast datasets within mere seconds. To further enhance our analytics capabilities, we introduced additional preprocessing steps, allowing us to apply multiple filters that counterbalance and enrich the data before indexing it in the OpenSearch cluster, thereby providing a more comprehensive and informative monitoring experience.

As part of our collaborative effort, we reflect on the path we’ve taken, highlighting the obstacles we faced, the alternatives we evaluated, and the reasoning behind our decisions to streamline our log management process.

Overview of the preliminary resolution

We had planned to store and process log data within an OpenSearch cluster, ultimately deciding to leverage the AWS-managed service, OpenSearch. To complement our log collection process, we aimed to integrate the data with Logstash, but since there wasn’t a managed AWS service available, we needed to deploy the appliance on an Amazon Elastic Compute Cloud (EC2) instance. To maintain this setup, we had to perform regular server upkeep while also leveraging and to deploy updated Logstash configurations and restart the service as needed. To ensure optimal performance and security, we also aimed to perform routine server maintenance tasks such as patching and updating the operating system (OS) and Logstash utility, while closely monitoring server resources including Java heap, CPU usage, memory allocation, and storage capacity.

The protracted validation process for determining the community path from the Logstash server to the OpenSearch cluster necessitates meticulous examination of Entry Management Lists (ACLs), security team approvals, and VPC subnet routing configurations. Scaling beyond a single EC2 instance, we encountered challenges in managing our auto-scaling group, Amazon Simple Queue Service (SQS) queues, and other related components? Maintaining the consistent execution of our resolutions demanded a significant investment of time and energy, thereby necessitating a redirection of focus away from our core responsibilities of managing and monitoring the platform.

The following diagram outlines our preliminary framework.

Potential options for us:

Our team evaluated several options for managing the logs from this platform. We evaluated Splunk as a potential alternative to Amazon OpenSearch Service, considering its capabilities in log storage and analysis. Notwithstanding our initial reservations, we ultimately chose to proceed with the proposal due to several compelling reasons.

  • Our team has a greater familiarity with OpenSearch Service and Logstash compared to Splunk.
  • By leveraging Amazon OpenSearch Service as a fully managed offering within AWS, we can streamline the log switching process, thereby outpacing our previous on-premises reliance on Splunk. Transferring logs to an on-site Splunk cluster would not only rack up exorbitant costs but also consume precious network bandwidth and add unnecessary intricacy to our systems.
  • Splunk’s pricing structure, mainly driven by storage costs per gigabyte, became economically unfeasible due to the enormous volume of log data we needed to store and process.

The OpenSearch ingestion pipeline requires preliminary design to ensure effective data processing and analysis. A well-structured pipeline must consider the following key components:

Data Sources – Identify the primary sources of data, such as Apache logs or IoT devices, and determine the format of the data.
Transformations – Determine the necessary transformations required to normalize the data, including converting data types, removing duplicates, and aggregating data.
Ingestion Process – Define the ingestion process, which involves collecting data from various sources, transforming it according to requirements, and loading it into OpenSearch for analysis and visualization.

Implementing an OpenSearch ingestion pipeline requires careful planning and execution to ensure seamless integration with existing systems.

The Amazon workforce approached me to discuss two new characteristics they had been launching. The characteristic delivered a seamless solution to the challenges we faced in handling EC2 instances for Logstash, significantly streamlining our case management processes. The introduction of this innovative feature significantly streamlined our team’s workload by automating the management of multiple EC2 instances, seamlessly scaling servers up or down in response to traffic fluctuations, and proactively monitoring log consumption and server performance metrics.

Secondly, Amazon OpenSearch ingestion pipelines supported almost all the Logstash filters we had been using in our existing solution, thereby enabling us to leverage the same performance for enriching logs as before.

As we were thrilled to be chosen for the Amazon Web Services (AWS) beta program, we found ourselves among the pioneering group of early and large-scale adopters. We embarked on a mission by parsing VPC circulation logs for our web ingress platform in tandem with Transit Gateway circulation logs that interconnected all VPCs within the Amazon Web Services ecosystem. Processing volumes of logs exceeding 14 terabytes daily, due to the sheer magnitude of Transit Gateway log circulation, presented a substantial workload challenge. As our scope evolved to encompass diverse log types, including ALB and NLB entry logs, as well as AWS WAF logs, the complexity of our response increased accordingly, ultimately driving up costs.

Despite our initial optimism, however, the difficulties we faced at first did much to temper our excitement. Despite our best endeavors, we still faced challenges in optimizing performance within this domain. Through a collaborative effort with the AWS team, we discovered and addressed misconfigurations within our infrastructure. We had been struggling with cases that were undersized to accommodate the volume of information at our disposal. As a result, the systems operated at maximum processing capacity, resulting in a substantial accumulation of incoming log data. The congestion in our OpenSearch ingestion processes led to an unexpected surge in scaling requirements, as the underlying cluster struggled to keep pace with demand.

The underlying issues with our clustering architecture resulted in subpar performance. As a result of our discovery, we found ourselves hindered from accessing and reviewing log data in real-time, with delays often extending up to several days beyond the original entry date. In fact, the costs stemming from these inefficiencies significantly surpassed our initial projections.

Despite the challenges, the AWS team’s expertise enabled us to effectively address these concerns, streamlining our infrastructure for enhanced productivity and reduced costs. Expertise highlighted the pivotal role of accurate configuration and seamless collaboration in unlocking the full potential of AWS providers, ultimately yielding a more effective outcome for our knowledge ingestion workflows.

What can be streamlined in your OpenSearch Ingestion pipelines to boost efficiency and reduce latency? By leveraging optimized design patterns and configuration best practices, you can effectively process and enrich your data streams while minimizing downtime.

In partnership with AWS, we leveraged their expertise to develop a solution that not only excelled in performance but also delivered cost savings and seamless integration with our existing monitoring infrastructure. The OpenSearch Service area utilizes an Amazon S3 Choose within a pipeline supply to selectively ingest specific log fields, with additional selective ingestion capabilities achieved through. You should utilize include_keys and exclude_keys While on vacation? We leveraged the built-in characteristic to prune logs older than a predetermined interval, thereby reducing the overall cost of the cluster.

Ingested logs in Amazon OpenSearch Service enable us to amalgamate disparate data sets, yielding valuable insights that illuminate characteristics and trends across the entire ecosystem. To facilitate comprehensive analysis of combined logs alongside all native log fields, we employ table partitioning to efficiently query (Amazon S3) the Parquet-formatted logs, thereby reducing costs and enhancing performance.

This comprehensive resolution significantly boosts our platform’s visibility, substantially slashes the costs associated with processing large volumes of logs, and accelerates our ability to identify root causes during platform incident troubleshooting.

The following diagram illustrates our optimised structure.

Efficiency comparability

The study compares the efficiency of preliminary designs implemented using Logstash on Amazon EC2, an innovative OpenSearch Ingestion pipeline configuration, and a refined OpenSearch Ingestion pipeline configuration.

Upkeep Effort The revised text is:

Our workforce was forced to devote considerable resources to handling numerous providers and cases, thereby diverting attention away from effectively managing and monitoring our platform.

OpenSearch Ingestion handled most of the behind-the-scenes complexity, freeing up the team to focus on maintaining the single configuration file that governs the ingestion pipeline. OpenSearch Ingestion handled much of the mundane heavy lifting, freeing up team members to focus on maintaining and updating the ingestion pipeline configuration files.
Efficiency EC2 instances configured to run Logstash can seamlessly scale up or down according to demand, as needed, within an auto-scaling group. As a consequence of insufficient source data on the OpenSearch cluster, the ingestion pipelines have consistently operated at maximum OpenSearch Compute Items (OCUs), resulting in a significant delay to log supply, with some records taking several days to process. Ingestion pipelines can be scaled up or down according to desired Oracle Cloud Units (OCUs).
Actual-time Log Availability To effectively process, manage, and deliver a wide range of log types in Amazon S3, we required a significant number of scalable EC2 instances. To minimize costs without sacrificing quality, we opted for a reduced case volume, resulting in delayed log delivery to OpenSearch. Due to a dearth of suitable sources on the OpenSearch cluster, the ingestion pipelines consistently operated at maximum capacity utilization units (OCUs), resulting in a substantial delay of log delivery over several days. The optimized resolution enabled rapid dispatching of numerous logs to OpenSearch for near-real-time analysis.
Value Saving Operating multiple providers and cases to ship logs to OpenSearch significantly increased the overall cost. Due to insufficient resource allocation on the OpenSearch cluster, the ingestion pipelines consistently operated at maximum utilization capacity, thereby increasing the cost of the service. The optimised resolution enabled scalable deployment of the ingestion pipeline’s OCUs, allowing for flexible capacity adjustments while maintaining a low overall cost.
General Profit

Conclusion

During this submission, I documented my experience building a solution using OpenSearch Service and OpenSearch Ingestion pipelines to craft an effective answer. By decoupling log analysis from log shipping, this resolution enables us to focus on processing and supporting our platform’s logs without being burdened by the logistical complexity of sending them to OpenSearch. We emphasized the need to optimise the service in order to boost efficiency and reduce costs.

As a next step, our objective is to explore the newly introduced features within OpenSearch Service. This step aims to further reduce answer costs and enhance supply flexibility by streamlining log ingestion’s timing and variety.


In regards to the Authors

Serves as an AWS Certified Specialist in Options Architecture, focusing on innovative data-driven solutions. With an unwavering passion to empower buyers, he excels at uncovering valuable perspectives from the vast expanse of human understanding. Through his expertise, he develops innovative solutions that enable organizations to make informed, data-backed decisions. Navnit Shukla, a renowned expert, authored the comprehensive e-book “Knowledge Wrangling on AWS”. Interested readers can connect with him through [insert contact method].

Serves as Senior Principal Cloud Platform Community Architect at a prominent multinational financial services corporation specializing in credit scoring and reporting. With over 16 years of experience in both on-premises and cloud networking, he is passionate about designing innovative cloud-based solutions that meet customers’ needs and resolve complex problems. Outside of labor, he relishes time spent with his family, taking leisurely trips back to the majestic mountains of Colorado.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles