This put up is co-written with Ido Ziv from Kaltura.
As organizations develop, managing observability throughout a number of groups and functions turns into more and more advanced. Logs, metrics, and traces generate huge quantities of knowledge, making it difficult to keep up efficiency, reliability, and cost-efficiency.
At Kaltura, an AI-infused video-first firm serving thousands and thousands of customers throughout tons of of functions, observability is mission-critical. Understanding system conduct at scale isn’t nearly troubleshooting—it’s about offering seamless experiences for purchasers and workers alike. However attaining efficient observability at this scale comes with challenges: managing spans; correlating logs, traces, and occasions throughout distributed techniques; and sustaining visibility with out overwhelming groups with noise. Balancing granularity, price, and actionable insights requires fixed tuning and considerate structure.
On this put up, we share how Kaltura remodeled its observability technique and technological stack by migrating from a software program as a service (SaaS) logging answer to Amazon OpenSearch Service—attaining increased log retention, a 60% discount in price, and a centralized platform that empowers a number of groups with real-time insights.
Observability challenges at scale
Kaltura ingests over 8TB of logs and traces every day, processing greater than 20 billion occasions throughout 6 manufacturing AWS Areas and over 200 functions—with log spikes reaching as much as 6 GB per second. This immense information quantity, mixed with a extremely distributed structure, created important challenges in observability. Traditionally, Kaltura relied on a SaaS-based observability answer that met preliminary necessities however turned more and more tough to scale. Because the platform advanced, groups generated disparate log codecs, utilized retention insurance policies that now not mirrored information worth, and operated greater than 10 organically grown observability sources. The shortage of standardization and visibility required intensive handbook effort to correlate information, preserve pipelines, and troubleshoot points – resulting in rising operational complexity and stuck prices that didn’t scale effectively with utilization.
Kaltura’s DevOps workforce acknowledged the necessity to reassess their observability answer and commenced exploring a wide range of choices, from self-managed platforms to totally managed SaaS choices. After a complete analysis, they made the strategic resolution emigrate to OpenSearch Service, utilizing its superior options reminiscent of Amazon OpenSearch Ingestion, the Observability plugin, UltraWarm storage, and Index State Administration.
Resolution overview
Kaltura created a brand new AWS account that may be a devoted observability account, the place OpenSearch Service was deployed. Logs and traces had been collected from completely different accounts and producers reminiscent of microservices on Amazon Elastic Kubernetes Service (Amazon EKS) and companies operating on Amazon Elastic Compute Cloud (Amazon EC2).
Through the use of AWS companies reminiscent of AWS Id and Entry Administration (IAM), AWS Key Administration Service (AWS KMS), and Amazon CloudWatch, Kaltura was in a position to meet the requirements to create a production-grade system whereas preserving safety and reliability in thoughts. The next determine reveals a high-level design of the surroundings setup.
Ingestion
As seen within the following diagram, logs are shipped utilizing log shippers, also called collectors. In Kaltura’s case, they used Fluent Bit. A log shipper is a device designed to gather, course of, and transport log information from numerous sources to a centralized location, reminiscent of log analytics platforms, administration techniques, or an aggregator system. Fluent Bit was utilized in all sources and likewise supplied gentle processing talents. Fluent Bit was deployed as a daemonset in Kubernetes. The applying growth groups didn’t change their code, as a result of the Fluent Bit pods had been studying the stdout of the applying pods.
The next code is an instance of FluentBit configurations for Amazon EKS:
Spans and traces had been collected immediately from the applying layer utilizing a seamless integration strategy. To facilitate this, Kaltura deployed an OpenTelemetry Collector (OTEL) utilizing the OpenTelemetry Operator for Kubernetes. Moreover, the workforce developed a customized OTEL code library, which was integrated into the applying code to effectively seize and log traces and spans, offering complete observability throughout their system.
Information from Fluent Bit and OpenTelemetry Collector was despatched to OpenSearch Ingestion, a totally managed, serverless information collector that delivers real-time log, metric, and hint information to OpenSearch Service domains and Amazon OpenSearch Serverless collections. Every producer despatched information to a particular pipeline, one for logs and one for traces, the place information was remodeled, aggregated, enriched, and normalized earlier than being despatched to OpenSearch Service. The hint pipeline used the otel_trace and service_map processors, whereas utilizing the OpenSearch Ingestion OpenTelemetry hint analytics blueprint.
The next code is an instance of the OpenSearch Ingestion pipeline for logs:
The previous instance reveals using processors reminiscent of grok, date, add_entries, rename_keys, and drop_events:
- add_entries:
- Provides a brand new subject
log_type
based mostly on filename - Default: “default”
- If the filename comprises particular substrings (reminiscent of
api.log
orstats.log
), it assigns a extra particular sort
- Provides a brand new subject
- grok:
- Applies Grok parsing to logs of sort “api”
- Extracts fields like
timestamp
,logIp
,host
,priorityName
,precedence
,reminiscence
,actual
, andmessage
utilizing a customized sample
- date:
- Parses timestamp strings into a regular datetime format
- Shops it in a subject referred to as
@timestamp
based mostly on ISO8601 format - Handles a number of timestamp patterns
- rename_keys:
- timestamp or date are renamed into
@timestamp
- Doesn’t overwrite if
@timestamp
already exists
- timestamp or date are renamed into
- drop_events:
- Drops logs the place filename comprises
simplesamlphp.log
- This can be a filtering rule to disregard noisy or irrelevant logs
- Drops logs the place filename comprises
The next is an instance of the enter of a log line:
After processing, we get the next code:
Kaltura adopted some OpenSearch Ingestion greatest practices, reminiscent of:
- Together with a dead-letter queue (DLQ) in pipeline configuration. This may considerably assist troubleshoot pipeline points.
- Beginning and stopping pipelines to optimize cost-efficiency, when attainable.
- Throughout the proof of idea stage:
- Putting in Information Prepper regionally for quicker growth iterations.
- Disabling persistent buffering to expedite blue-green deployments.
Reaching operational excellence with environment friendly log and hint administration
Logs and traces play an important position in figuring out operational points, however they arrive with distinctive challenges. First, they signify time collection information, which inherently evolves over time. Second, their worth sometimes diminishes as time passes, making environment friendly administration essential. Third, they’re append-only in nature. With OpenSearch, Kaltura confronted distinct trade-offs between price, information retention, and latency. The aim was to ensure precious information remained accessible to engineering groups with minimal latency, however the answer additionally wanted to be cost-effective. Balancing these elements required considerate planning and optimization.
Information was ingested to OpenSearch information streams, which simplifies the method of ingesting append-only time collection information. A number of Index State Administration (ISM) insurance policies had been utilized to completely different information streams, which had been depending on log retention necessities. ISM insurance policies dealt with shifting indexes from sizzling storage to UltraWarm, and ultimately deleting the indexes. This allowed a customizable and cost-effective answer, with low latency for querying new information and affordable latency for querying historic information.
The next instance ISM coverage makes positive indexes are managed effectively, rolled over, and moved to completely different storage tiers based mostly on their age and dimension, and ultimately deleted after 60 days. If an motion fails, it’s retried with an exponential backoff technique. In case of failures, notifications are despatched to related groups to maintain them knowledgeable.
To create an information stream in OpenSearch, a definition of index template is required, which configures how the information stream and its backing indexes will behave. Within the following instance, the index template specifies key index settings such because the variety of shards, replication, and refresh interval—controlling how information is distributed, replicated, and refreshed throughout the cluster. It additionally defines the mappings, which describe the construction of the information—what fields exist, their varieties, and the way they need to be listed. These mappings be certain the information stream is aware of interpret and retailer incoming log information effectively. Lastly, the template permits the @timestamp
subject because the time-based subject required for an information stream.
Implementing role-based entry management and consumer entry
The brand new observability platform is accessed by many sorts of customers; inner customers log in to OpenSearch Dashboards utilizing SAML-based federation with Okta. The next diagram illustrates the consumer circulation.
Every consumer accesses the dashboards to view observability gadgets related to their position. Advantageous-grained entry management (FGAC) is enforced in OpenSearch utilizing built-in IAM position and SAML group mappings to implement role-based entry management (RBAC).When customers log in to the OpenSearch area, they’re mechanically routed to the suitable tenant based mostly on their assigned position. This setup makes positive builders can create dashboards tailor-made to debugging inside growth environments, and help groups can construct dashboards centered on figuring out and troubleshooting manufacturing points. The SAML integration alleviates the necessity to handle inner OpenSearch customers fully.
For every position in Kaltura, a corresponding OpenSearch position was created with solely the mandatory permissions. As an illustration, help engineers are granted entry to the monitoring plugin to create alerts based mostly on logs, whereas QA engineers, who don’t require this performance, aren’t granted that entry.
The next screenshot reveals the position of the DevOps engineers outlined with cluster permissions.
These customers are routed to their very own devoted DevOps tenant, to which they solely have write entry. This makes it attainable for various customers from completely different roles in Kaltura to create the dashboard gadgets that target their priorities and wishes. OpenSearch helps backend position mapping; Kaltura mapped the Okta group to the position so when a consumer logs in from Okta, they mechanically get assigned based mostly on their position.
This additionally works with IAM roles to facilitate automations within the cluster utilizing exterior companies, reminiscent of OpenSearch Ingestion pipelines, as will be seen within the following screenshot.
Utilizing observability options and repair mapping for enhanced hint and log correlation
After a consumer is logged in, they’ll use the Observability plugins, view surrounding occasions in logs, correlate logs and traces, and use the Hint Analytics plugin. Customers can examine traces and spans, and group traces with latency data utilizing built-in dashboards. Customers can even drill right down to a particular hint or span and correlate it again to log occasions. The service_map
processor utilized in OpenSearch Ingestion sends OpenTelemetry information to create a distributed service map for visualization in OpenSearch Dashboards.
Utilizing the mixed alerts of traces and spans, OpenSearch discovers the applying connectivity and maps them to a service map.
After OpenSearch ingests the traces and spans from Otel, they’re aggregated to teams based on paths and traits. Durations are additionally calculated and offered to the consumer over time.
With a hint ID, it’s attainable to filter out all of the related spans by the service and see how lengthy every took, figuring out points with exterior companies reminiscent of MongoDB and Redis.
From the spans, customers can uncover the related logs.
Put up-migration enhancements
After the migration, a robust developer neighborhood emerged inside Kaltura that embraced the brand new observability answer. As adoption grew, so did requests for brand new options and enhancements geared toward bettering the general developer expertise.
One key enchancment was extending log retention. Kaltura achieved this by re-ingesting historic logs from Amazon Easy Storage Service (Amazon S3) utilizing a devoted OpenSearch Ingestion pipeline with Amazon S3 learn permissions. With this enhancement, groups can entry and analyze logs from as much as a 12 months in the past utilizing the identical acquainted dashboards and filters.
Along with monitoring EKS clusters and EC2 cases, Kaltura expanded its observability stack by integrating extra AWS companies. Amazon API Gateway and AWS Lambda had been launched to help log ingestion from exterior distributors, permitting for seamless correlation with present information and broader visibility throughout techniques.
Lastly, to empower groups and promote autonomy, information stream templates and ISM insurance policies are managed immediately by builders inside their very own repositories. Through the use of infrastructure as code instruments like Terraform, builders can outline index mappings, alerts, and dashboards as code—versioned in Git and deployed persistently throughout environments.
Conclusion
Kaltura efficiently carried out a sensible log retention technique, extending actual time retention from 5 days for all log varieties to 30 days for crucial logs, whereas sustaining cost-efficiency via using UltraWarm nodes. This strategy led to a 60% discount in prices in comparison with their earlier answer. Moreover, Kaltura consolidated their observability platform, streamlining operations by merging 10 separate techniques right into a unified, all-in-one answer. This consolidation not solely improved operational effectivity but additionally sparked elevated engagement from developer groups, driving function requests, fostering inner design collaborations, and attracting early adopters for brand new enhancements. If Kaltura’s journey has impressed you and also you’re fascinated with implementing the same answer in your group, contemplate these steps:
- Begin by understanding the necessities and setting expectations with the engineering groups in your group
- Begin with a fast proof of idea to get hands-on expertise
- Confer with the next sources that can assist you get began:
In regards to the authors
Ido Ziv is a DevOps workforce chief in Kaltura with over 6 years of expertise. His hobbies embrace crusing and Kubernetes (however not on the similar time).
Roi Gamliel is a Senior Options Architect serving to startups construct on AWS. He’s passionate in regards to the OpenSearch Mission, serving to prospects fine-tune their workloads and maximize outcomes.
Yonatan Dolan is a Principal Analytics Specialist at Amazon Internet Providers. He’s situated in Israel and helps prospects harness AWS analytical companies to make use of information, achieve insights, and derive worth.