Monday, August 25, 2025

Zeta reduces banking incident response time by 80% with Amazon OpenSearch Service observability

It is a visitor publish co-written with Shashidhar Soppin, Manochandra Menni and Anchal Kansal from Zeta.

Zeta is a core banking know-how supplier that permits banks to quickly launch extensible banking property and legal responsibility merchandise. Zeta’s major merchandise are Olympus and Tachyon. Olympus is a platform as a service (PaaS) that simplifies constructing and working cloud-native, safe and distributed multi-tenant software program as a service (SaaS) merchandise. It blends infrastructure as code and GitOps methodologies for environment friendly and constant deployment of SaaS merchandise. Its structure prioritizes sturdy tenant isolation, real-time occasion processing, and complete observability, supporting strong API integrations and seamless deployment. Zeta’s Tachyon is a full-stack, cloud-native, API-first digital-banking SaaS service delivered through Olympus. The banking companies of Tachyon embody fee engines (for UPI, credit score, debit, and pay as you go playing cards), financial savings & checking account administration, and so forth. Tachyon is a contemporary debit processing product with private finance administration and card controls. It’s designed to extend utilization, upsell credit score, cut back fraud, and enhance buyer satisfaction. The Tachyon product presents complete provisioning, funds, and account administration APIs and SDKs, enabling seamless integration of monetary merchandise into third-party apps with out compromising privateness and safety. Zeta operates Tachyon as a multi-tenant SaaS product, serving prospects who’re configured as particular person tenants inside the system. Zeta’s know-how stack is monitored by their Buyer Service Navigator product (CSN), which is a part of Olympus.

As a worldwide SaaS supplier, Zeta wanted an answer able to monitoring tenants, measuring SLAs, assembly native regulatory necessities, and scaling effectively with each new tenant onboarding and seasonal utilization spikes. Zeta sought a cheap, scalable system that would supply a unified “single pane of glass” to watch the appliance companies, cloud infrastructure, open-source elements, and third-party merchandise.

Zeta confronted a formidable problem in orchestrating a cohesive monitoring system throughout a quickly increasing multi-tenant setting, numerous domains, and quite a few instruments. As extra tenants joined their system, the complexity grew exponentially, making Zeta’s monitoring resolution more and more tough to keep up. The first problem stemmed from fragmented monitoring instruments that made it tough to rapidly establish root causes throughout interconnected techniques, resulting in extended troubleshooting instances and potential service degradation. When customers reported points, resembling bank card fee issues, Website Reliability Engineering (SRE) crew needed to navigate by way of a a number of disparate monitoring instruments and siloed knowledge, and the shortage of built-in observability resulted in time-consuming guide correlation efforts. This multi-tenant, multi-solution panorama considerably sophisticated the flexibility to keep up constant monitoring requirements and repair ranges. The problem was additional sophisticated by the advanced regulatory panorama, the place international enlargement required adherence to numerous native laws, necessitating a versatile structure able to accommodating various knowledge retention insurance policies and entry controls throughout completely different jurisdictions. Every new tenant addition multiplied the complexity of balancing the monitoring wants of inner SRE groups and prospects, requiring subtle knowledge segregation and entry administration. Moreover, Zeta required complete anomaly detection capabilities throughout techniques, elements, infrastructure, and operations, requiring an answer that would scale dynamically whereas establishing dynamic baselines and figuring out delicate patterns that may point out rising points. Because the tenant base continued to develop, the necessity for a unified, scalable monitoring resolution that would streamline these processes, improve operational visibility, and keep system integrity grew to become important.

Zeta’s purpose was to streamline their processes and improve operational visibility throughout the whole know-how panorama. By addressing these challenges, Zeta aimed to create a unified observability resolution that might considerably enhance incident response instances, improve regulatory compliance posture, and finally ship a extra dependable and performant service to their international buyer base.

On this publish we clarify how Zeta constructed a extra unified monitoring resolution utilizing Amazon OpenSearch Service that improved efficiency, decreased guide processes, and elevated end-user satisfaction. Zeta has achieved over an 80% discount in imply time to decision (MTTR), with incident response instances reducing from 30+ minutes to underneath 5 minutes.

Resolution overview

Zeta designed and constructed an observability system, CSN, to ship complete visibility throughout the service setting. CSN is a part of the Olympus suite of merchandise. CSN serves as the first interface for the SRE crew, providing real-time service well being dashboards, infrastructure monitoring, SLA efficiency analytics, and an admin panel for person administration. The system is provided with single sign-on (SSO) integration and enforces role-based entry management (RBAC) to allow safe, granular entry. With CSN, SREs can effectively monitor system well being, obtain actionable alerts and warnings, and handle operational workflows throughout important companies.

CSN is powered by OpenSearch Service to supply an built-in resolution for DevOps and Website Reliability Engineers to assist establish important occasions and points. Zeta selected OpenSearch Service as a result of it presents a totally managed, open-source search analytics engine that scales effortlessly to deal with the growing variety of tenants, related knowledge development, and analytics wants. It’s seamless integration with AWS companies, strong safety features, and help for real-time knowledge ingestion and querying make it ultimate for powering the CSN dashboards and analytics workloads. The next diagram illustrates the CSN deployment structure.

Zeta CSN Deployment Architecture

The OpenSearch Service area makes use of the Multi-AZ with Standby deployment mannequin, following AWS greatest practices for top availability and fault tolerance. Nodes—together with devoted cluster supervisor nodes, knowledge nodes, and UltraWarm nodes—are distributed evenly throughout three Availability Zones in the identical AWS Area. Availability Zones 1 and a couple of deal with lively indexing and search visitors, and Availability Zone 3 accommodates standby nodes that stay passive throughout regular operations. If an Availability Zone failure happens, OpenSearch Service robotically promotes standby nodes to lively standing, sustaining cluster operations with minimal disruption and no want for knowledge redistribution.

The OpenSearch cluster consists of three devoted cluster supervisor nodes and a multiple-of-three knowledge node rely to keep up quorum and balanced shard allocation. Every index makes use of at the very least two replicas, offering redundant copies of knowledge throughout the Availability Zones. This Multi-AZ with Standby configuration delivers excessive resilience and speedy failover, supporting steady service availability and strong catastrophe restoration for the observability workloads.

Information assortment and ingestion

The observability technique facilities on an information assortment and ingestion pipeline designed to deal with the complexity and scale. The structure, as proven within the following diagram, addresses three important knowledge sorts: AWS useful resource logs, utility logs, and distributed traces, with every knowledge kind utilizing tailor-made assortment and processing strategies optimized for the workloads.

Zeta CSN Data Ingestion

AWS useful resource logs assortment

The infrastructure spans a number of AWS companies together with Amazon Elastic Kubernetes Service(Amazon EKS), Amazon Relational Database Service(Amazon RDS), Amazon Redshift, Utility Load Balancer, Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Elastic Compute Cloud (Amazon EC2) and extra. Zeta makes use of Amazon CloudWatch Logs as the first assortment level for AWS service logs, which offers native integration with these companies.

AWS companies ship their logs on to CloudWatch Logs, that are then pulled by Fluentd operating on the Amazon EKS cluster for centralized processing. This method natively captures operational knowledge from the AWS sources, together with:

  • Database operational logs and audit trails from Amazon RDS cases
  • Information warehouse question execution logs from Amazon Redshift
  • Utility Load Balancer entry logs capturing visitors patterns and efficiency metrics
  • Kafka cluster operational logs from Amazon MSK
  • AWS API invocation audit trails from AWS CloudTrail
  • Container runtime and working system logs from Amazon EC2
  • Through the log assortment, personally identifiable info (PII) is filtered out. The answer adheres strictly to PCI-DSS tips all through this course of.

Zeta used Amazon MSK as a scalable and dependable spine for amassing and streaming logs from varied sources throughout the AWS sources. Logs are ingested into Amazon MSK, offering a sturdy and fault-tolerant buffer that decouples log producers from customers. This structure permits real-time log streaming and helps superior processing pipelines earlier than the logs are routed to the OpenSearch Service. By integrating Amazon MSK into the logging workflow, scalability, resilience, and adaptability is improved, so that prime log volumes are effectively managed with out impacting downstream techniques. This method, mixed with native AWS integrations, minimizes operational complexity and maintains complete, centralized log visibility throughout the cloud setting.

Fluentd processes these logs and routes them on to OpenSearch Service, sustaining the advantages of AWS integration whereas offering centralized accessibility. This centralized logging method with built-in buffering capabilities reduces the direct load on OpenSearch Service by batching and optimizing log supply, serving to to stop potential ingestion bottlenecks throughout high-volume intervals. The method alleviates the necessity for customized log transport brokers on AWS sources, decreasing operational overhead whereas sustaining complete protection of the cloud infrastructure.

Utility logs processing

For application-level observability, a pipeline utilizing Fluentd is deployed as Kubernetes DaemonSet. Utility microservices operating on Amazon EKS generate logs that Fluentd DaemonSets gather, parses, and enrich with metadata resembling pod names, namespaces, and repair identifiers. The processed logs then circulation by way of Amazon MSK for dependable, high-throughput message streaming earlier than last processing by Fluentd and indexing in OpenSearch Service.

This Kafka-based method offers a number of benefits:

  • Decoupling – This helps producers and customers to function independently, in order that Zeta can scale ingestion and processing individually primarily based on demand.
  • Backpressure dealing with – Utilizing Kafka’s buffering capabilities, this manages visitors spikes throughout peak banking hours, absorbing sudden will increase in log quantity whereas sustaining system stability throughout seasonal utilization surges.
  • Sturdiness of logs – The system maintains logs durably in order that no log knowledge is misplaced throughout system upkeep or sudden failures by way of message persistence.

The logs then cross by way of a second Fluentd layer for last processing and routing to OpenSearch Service, the place they’re listed throughout service-specific indexes (app-index, falco-index, kong-index).

Distributed hint assortment

To deal with the problem of correlating points throughout Zeta’s microservices structure, system makes use of distributed tracing utilizing Jaeger, an open-source, end-to-end distributed tracing system. Jaeger permits monitoring and troubleshooting transactions in advanced distributed techniques by monitoring requests as they circulation by way of a number of companies. The applying companies and Kong API Gateway are instrumented with Jaeger consumer libraries that generate hint knowledge together with spans, which characterize particular person operations inside a hint. Every span accommodates metadata resembling operation names, begin and end timestamps, tags, and logs that present context in regards to the operation being carried out. The Jaeger Collector aggregates these spans from a number of companies, performing validation, indexing, and transformation earlier than forwarding the information.

The traces circulation by way of Amazon MSK for a similar reliability advantages because the logging pipeline – offering sturdiness, decoupling, and backpressure dealing with throughout high-volume intervals. Jaeger Ingester then consumes traces from Amazon MSK and processes them for storage within the jaeger-index inside OpenSearch Service.

This knowledge assortment and ingestion technique offers full end-to-end visibility and builds an observability system that permits SRE groups to watch, troubleshoot, and optimize the companies throughout the whole know-how stack.

Storage tiering

To handle the log, metric, and hint knowledge at scale—about 3TB generated day by day—the answer applied OpenSearch Service storage tiering to stability efficiency, retention, and price. Zeta requires close to real-time search and retrieval for at the very least every week, whereas retaining logs and traces for as much as 10 years. Conserving this knowledge in lively clusters would affect search efficiency and considerably enhance prices, so the answer makes use of the OpenSearch Service sizzling, UltraWarm, and chilly storage tiers to optimize the information lifecycle. The next diagram illustrates storage tiering in OpenSearch Service.

Zeta CSN Storage Tiering

Sizzling storage is used for the latest and ceaselessly accessed knowledge, supporting real-time indexing and low-latency queries. This tier depends on high-performance storage connected to plain knowledge nodes, making it ultimate for powering reside dashboards and analytics the place pace is important. The answer makes use of AWS Graviton 2 powered m6g.4xlarge.search occasion sorts to run the OpenSearch Service area which offers upto 40% decrease price in comparison with x86 primarily based cases. Every sizzling knowledge node has an connected gp3 EBS quantity to retailer indexes. Zeta maintains knowledge in sizzling storage for 1 week.

UltraWarm storage serves as a cheap layer for older, read-only knowledge that’s queried much less ceaselessly however nonetheless wants to stay searchable. UltraWarm nodes use Amazon Easy Storage Service (Amazon S3) because the backing retailer with an built-in caching mechanism, to retain massive volumes of knowledge at a fraction of the price of sizzling storage whereas nonetheless supporting interactive queries for historic evaluation. Zeta makes use of ultrawarm1.massive.search occasion sorts within the UltraWarm storage tier and maintains knowledge in UltraWarm storage for 15 days.

Chilly storage is designed for long-term archival of occasionally accessed or compliance-driven knowledge. Information in chilly storage is indifferent from lively compute sources and resides in Amazon S3, incurring minimal price. When historic knowledge must be queried, the indexes are connected to the UltraWarm nodes utilizing OpenSearch API calls. This helps extracting historic knowledge for audits, periodic analysis or forensic investigations with out sustaining lively compute for the whole retention interval, thereby decreasing storage price.

OpenSearch Service automates index transitions between sizzling, UltraWarm, and chilly storage tiers utilizing Index State Administration (ISM) insurance policies. ISM insurance policies specify the circumstances and actions for every state, resembling transitioning primarily based on index age, measurement, or doc rely. When an index qualifies for a transition, ISM jobs—operating each 5 to eight minutes—consider the coverage and transfer the index to the following tier. When indexes attain the UltraWarm threshold, they’re migrated to UltraWarm nodes backed by Amazon S3, which reduces storage prices whereas maintaining knowledge accessible for queries. After the UltraWarm retention interval, ISM archives the indexes to chilly storage, detaching them from compute sources however permitting reattachment for future queries or compliance wants. This automated lifecycle administration reduces operational overhead, optimizes storage prices, and maintains efficiency for each latest and historic knowledge.

For observability knowledge, new indexes are created within the sizzling tier, the place they continue to be for 7 days to help quick ingestion and low-latency queries. After this era, ISM transitions these indexes to UltraWarm storage, the place they’re retained for a further 15 days as read-only knowledge, balancing price with searchability.

Safety

Safety is probably the most important a part of the structure. Zeta’s observability system implements a number of layers of safety for knowledge confidentiality, integrity, and compliance with banking laws, and is constructed utilizing a zero-trust method following the AWS shared duty mannequin for OpenSearch Service:

  • Infrastructure safety: The OpenSearch Service area is deployed inside a digital non-public cloud (VPC) with non-public subnets, isolating it from direct web entry. Safety teams implement restrictive ingress guidelines, permitting entry solely from approved sources. The OpenSearch Service area makes use of encryption at relaxation by way of AWS Key Administration Service (KMS). Information in transit is secured utilizing TLS 1.3 encryption, in order that log knowledge, traces, and search queries stay protected throughout transmission. Service-to-service communication makes use of AWS Identification and Entry Administration (IAM) roles and encrypted connections, assuaging the necessity for hardcoded credentials.
  • Entry management and authentication: The answer makes use of Amazon OpenSearch Service fine-grained entry management(FGAC) built-in with IAM, the place IAM serves because the authentication supplier and FGAC handles authorization by mapping IAM roles to OpenSearch backend roles. This method helps Zeta to manage entry permissions on the index and doc stage primarily based on tenant necessities and person obligations. The info ingestion pipeline implements end-to-end safety with Fluentd authenticating to Amazon MSK utilizing IAM roles over encrypted connections. Amazon MSK clusters use encryption in transit and at relaxation, defending log knowledge all through the streaming pipeline. Kubernetes RBAC insurance policies limit pod-to-pod communication and restrict service account permissions.
  • Information privateness and tenant isolation: Every tenants’ knowledge is maintained in logical separation in OpenSearch Service utilizing tenant id. CSN implements tenant-aware authentication and authorization with FGAC, limiting customers to their approved tenants’ dashboards and knowledge. Each API endpoint validates tenant context, in order that customers can solely entry knowledge inside their approved scope. Importantly, no buyer knowledge is captured within the logs – solely system metrics are used to construct the monitoring system, adhering to banking safety requirements and greatest practices. Consumer actions are audited and logged for compliance functions, with audit trails maintained in keeping with regulatory necessities.

This safety framework permits the observability system meet the safety necessities of core banking operations whereas sustaining operational effectivity and regulatory compliance throughout international industries.

Buyer Service Navigator

CSN delivers SREs a strong diagnostics interface engineered for high-efficiency monitoring, deep evaluation, and speedy troubleshooting of system efficiency throughout distributed environments. The system ingests and processes telemetry knowledge at sub-minute intervals, offering near-real-time metrics, traces, and logs from important infrastructure elements. Actionable, interactive visualizations—resembling heatmaps, anomaly graphs, and dependency maps— helps SREs to rapidly detect SLO breaches and drill all the way down to granular root causes, usually inside a couple of minutes of an incident.

The next screenshot reveals an instance service well being dashboard in CSN for an Olympus tenant.

Zeta CSN Service Health Dashboard

The next screenshot reveals an instance of the API efficiency insights dashboard in CSN.

Zeta CSN API Performance Dashboard

Enterprise and technical advantages

The OpenSearch Service-based CSN System offers the next enterprise and technical advantages:

  • Handbook effort is decreased by way of automated Index State Administration (ISM) and lifecycle insurance policies, in order that Zeta’s groups to deal with innovation
  • Automated lifecycle insurance policies facilitate seamless retention and archiving of compliance knowledge, decreasing the chance of non-compliance
  • The system helps log retention for over 10 years to satisfy regulatory necessities for Zeta’s banking and monetary companies prospects
  • A number of layers of safety—together with encryption at relaxation and in transit, FGAC, and tenant isolation to guard buyer knowledge and help Zeta’s zero-trust structure
  • By consolidating logs, traces, and metrics from disparate techniques into OpenSearch, SRE groups can correlate occasions extra successfully, thereby decreasing troubleshooting efforts and reaching an 80% enchancment in MTTR
  • Zeta achieved 99.999999999% knowledge sturdiness for archived logs saved in Amazon S3, offering long-term knowledge integrity
  • Zstandard compression is being applied to optimize long-term storage prices

Conclusion

CSN’s superior correlation engine robotically associates associated occasions throughout microservices, databases, community layers, and infrastructure, considerably streamlining root trigger evaluation. Built-in alerting and automatic runbooks additional cut back response instances. Since implementing CSN, Zeta has achieved over an 80% discount in MTTR, with incident response instances reducing from 30+ minutes to underneath 5 minutes. The service helps seamless multi-tenant monitoring, processes 3TB of machine-generated knowledge day by day, and is architected for petabyte-scale development. Moreover, CSN helps Zeta meet regulatory necessities for retaining historic logs over a number of years whereas maintaining storage prices underneath management. This has considerably improved operational resilience, elevated service availability, and empowered groups to proactively resolve points earlier than they have an effect on finish customers.

Able to take your group’s observability capabilities to the following stage? Dive into the technical particulars of OpenSearch Service within the Amazon OpenSearch Developer Information. Go to our new migration hub web page for extra prescriptive steerage on shifting your workloads to OpenSearch Service.


Concerning the authors

Deepesh DhapolaDeepesh Dhapola is a Senior Options Architect at AWS India, the place he architects high-performance, resilient cloud options for monetary companies and fintech organizations. He focuses on utilizing superior AI applied sciences—together with generative AI, clever brokers, and the Mannequin Context Protocol (MCP)—to design safe, scalable, and context-aware functions. With deep experience in machine studying and a eager deal with rising tendencies, Deepesh drives digital transformation by integrating cutting-edge AI capabilities to boost operational effectivity and foster innovation for AWS prospects. Past his technical pursuits, he enjoys high quality time along with his household and explores inventive culinary methods.

Shashidhar (Shashi) SoppinShashidhar (Shashi) Soppin is an achieved Enterprise Architect and cloud transformation chief with over 24+ years of expertise spanning regulated industries and high-growth know-how environments. Presently steering strategic initiatives as Lead Architect at Zeta’s CTO workplace, Shashidhar has helped in constructing and led world-class engineering groups, driving innovation in cloud, safety, and fintech domains. He has architected safe, scalable platforms—scaling person bases by 10x, enabling advanced integrations for main Financial institution’s migration to Zeta’s platforms, and pioneering Zero Belief frameworks that achieved excellent regulatory compliance. A results-driven govt and former DMTS at Wipro, Shashidhar holds 25+ granted patents and has delivered multi-million greenback enterprise offers throughout domains together with AI/ML. Famend as a printed writer (“Necessities of Deep Studying”), frequent trade speaker, and hands-on innovator, he combines technical experience with enterprise acumen, propelling organizations towards strong, future-ready cloud ecosystems and operational excellence. Previous to Wipro he labored in IBM-ISL as nicely.

Anchal KansalAnchal Kansal is a Lead Website Reliability Engineer at Zeta, the place she has spent the previous 4 years constructing and scaling dependable, high-performance techniques. With deep experience in OpenSearch, observability platforms, and large-scale infrastructure, she focuses on making certain uptime, efficiency, and operational effectivity. Anchal is captivated with fixing advanced reliability challenges and sharing sensible insights with the engineering neighborhood.

Mano (Manochandra)Manochandra (Mano) is the Website Reliability Engineering (SRE) professional at Zeta, specializing in knowledge management-oriented techniques. With a deep understanding of large-scale distributed architectures, he has in depth expertise designing, deploying, and sustaining resilient, production-grade OpenSearch techniques. Mano is understood for his proactive method in optimizing infrastructure reliability and efficiency, in addition to his capability to troubleshoot advanced operational challenges. His experience spans implementing automation, monitoring, and incident administration greatest practices, making him a go-to useful resource for making certain service availability and scalability at Zeta.

 Hitesh SubnaniHitesh Subnani is a FSI Options Architect at AWS India, the place he works with prospects to design and construct architectures that ship enterprise worth. He focuses on complete observability and analytics techniques, enabling organizations to realize deep insights from operational knowledge. With experience in search and analytics applied sciences, Hitesh focuses on scalable monitoring techniques, real-time dashboards, and compliance-driven architectures for AWS prospects within the monetary sector.

Tarun ChakrabortyTarun Chakraborty is a Sr. Technical Account Supervisor (TAM) at AWS India, the place he companions with main banks and fintech organizations to speed up their cloud transformation journeys. With over 15 years of expertise in know-how and monetary companies, he serves as a trusted advisor serving to prospects leverage AWS’s complete suite of companies to drive innovation and obtain their enterprise goals.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles