As a worldwide software-as-a-service (SaaS) firm specializing in offering intuitive, AI-powered enterprise options, designed to reinforce buyer and worker experiences. Freshworks is determined by real-time knowledge to energy decision-making and ship higher experiences to its 75,000+ prospects. With hundreds of thousands of every day occasions throughout merchandise, well timed knowledge processing is essential. To satisfy this want, Freshworks has constructed a near-real-time ingestion pipeline on Databricks, able to managing various schemas throughout merchandise and dealing with hundreds of thousands of occasions per minute with a 30-minute SLA—whereas making certain tenant-level knowledge isolation in a multi-tenant setup.
Attaining this requires a robust, versatile, and optimized knowledge pipeline—which is precisely what we had been got down to construct.
Legacy Structure and the Case for Change
Freshworks’ legacy pipeline was constructed with Python customers; the place every person motion triggered occasions despatched in actual time from merchandise to Kafka and the Python customers remodeled and routed these occasions to new Kafka subjects. A Rails batching system then transformed the remodeled knowledge into CSV information saved in AWS S3, and Apache Airflow jobs loaded these batches into the information warehouse. After ingestion, intermediate information had been deleted to handle storage. This structure was well-suited for early progress however quickly hit limits as occasion quantity surged.
Fast progress uncovered core challenges:
- Scalability: The pipeline struggled to deal with hundreds of thousands of messages per minute, particularly throughout spikes, and required frequent handbook scaling.
- Operational Complexity: The multi-stage circulate made schema adjustments and upkeep dangerous and time-consuming, usually leading to mismatches and failures.
- Value Inefficiency: Storage and compute bills grew rapidly, pushed by redundant processing and lack of optimization.
- Responsiveness: The legacy setup couldn’t meet calls for for real-time ingestion or quick, dependable analytics as Freshworks scaled. Extended ingestion delays impaired knowledge freshness and impacted buyer insights.
As scale and complexity elevated, the fragility and overhead of the outdated system made clear the necessity for a unified, scalable, and autonomous knowledge structure to assist the enterprise progress and analytics wants.
New Structure: Actual-Time Information Processing with Apache Spark and Delta Lake
The answer – A foundational redesign centred on Spark Structured Streaming and Delta Lake, purpose-built for near-real-time processing, scalable transformations, and operational simplicity.
We designed a single, streamlined structure the place Spark Structured Streaming instantly consumes from Kafka, transforms knowledge, and writes it into Delta Lake—multi functional job, working fully inside Databricks.
This shift has lowered knowledge motion, simplified upkeep and troubleshooting, and accelerated time-to-insight.
The important thing elements of the brand new structure:
The Streaming Element : Spark Structured Streaming
Every incoming occasion from Kafka passes by means of a fastidiously orchestrated collection of transformation steps in Spark streaming; optimized for accuracy, scale, and cost-efficiency:
- Environment friendly Deduplication:
Occasions, recognized by UUIDs, are checked towards a Delta desk of beforehand processed UUIDs to filter duplicates between streaming batches. - Information Validation:
Schema and enterprise guidelines filter malformed information, guarantee required fields, and deal with nulls. - Customized Transformations with JSON-e:
The JSON-e engine permits superior, reusable logic—like conditionals, loops, and Python UDFs—enabling product groups to outline dynamic, reusable logic tailor-made to every product. - Flattening to Tabular Type:
Remodeled JSON occasions are flattened into 1000’s of structured tables. A separate inside schema administration device ( managing 20,000+ tables & 5M+ columns) lets product groups handle schema adjustments and mechanically promote to manufacturing, which is registered in Delta Lake and picked up by Spark streaming seamlessly. - Flattened Information Deduplication:
A hash of saved columns is in contrast towards the final 4 hours of processed knowledge in Redis, stopping duplicate ingestion and decreasing compute prices.
The Storage Element: Lakehouse
As soon as remodeled, the information is written on to Delta Lake tables utilizing a number of highly effective optimizations:
- Parallel Writes with Multiprocessing:
A single Spark job sometimes writes to ~250 Delta tables, making use of various transformation logic. That is executed utilizing Python multiprocessing, which performs Delta merges in parallel, maximising cluster utilization and decreasing latency. - Environment friendly Updates with Deletion Vectors:
As much as 35% of information per batch are updates or deletes. As a substitute of rewriting massive information, we leverage Deletion Vectors to allow smooth deletes. This improves replace efficiency by 3x, making real-time updates sensible even at a terabyte scale. - Accelerated Merges with Disk Caching:
Disk Caching ensures that incessantly accessed (sizzling) knowledge stays in reminiscence. By caching solely the columns wanted for merges, we obtain as much as 4x quicker merge operations whereas decreasing I/O and compute prices. At present, 95% of merge reads are served instantly from the cache.
Autoscaling & Adapting in Actual Time
Autoscaling is constructed into the pipeline to make sure that the system scales up or down dynamically to deal with quantity and price most effectively with out impacting efficiency.
Autoscaling is pushed by batch lag and execution time, monitored in actual time. Resizing is triggered by way of job APIs by means of Spark’s QueryListener (OnProgress methodology after every batch), making certain in-flight processing isn’t disrupted. This manner the system is responsive, resilient, and environment friendly with out handbook intervention.
Constructed-In Resilience: Dealing with Failures Gracefully
To take care of knowledge integrity and availability, the structure consists of sturdy fault tolerance:
- Occasions that fail transformation are retried by way of Kafka with backoff logic.
- Completely failed information are saved in a Delta desk for offline overview and reprocessing, making certain no knowledge is misplaced.
- This design ensures knowledge integrity with out human intervention, even throughout peak hundreds or schema adjustments and the flexibility to republish the failed knowledge later.
Observability and Monitoring at Each Step
A robust monitoring stack—constructed with Prometheus, Grafana, and Elasticsearch—built-in with Databricks offers us end-to-end visibility:
- Metrics Assortment:
Each batch in Databricks logs key metrics—akin to enter file depend, remodeled information, and error charges, that are built-in to Prometheus, with real-time alerts to the assist staff. - Occasion Monitoring:
Occasion statuses are logged in Elasticsearch, enabling fine-grained debugging permitting each product(producers) and analytics (client) groups to hint points.
Transformation & Batch Execution Metrics:
Observe transformation well being utilizing above metrics to establish points and set off alerts for fast investigations
From Complexity to Confidence
Maybe probably the most transformative shift has been in simplicity.
What as soon as concerned 5 programs and numerous integration factors is now a single, observable, autoscaling pipeline working fully inside Databricks. We’ve eradicated brittle dependencies, streamlined operations, and enabled groups to work quicker and with higher autonomy.Basically Fewer transferring components meant Fewer surprises & Extra confidence.
By reimagining the information stack round streaming and the Deltalake, we’ve constructed a system that not solely meets right this moment’s scale however is prepared for tomorrow’s progress.
Why Databricks?
As we reimagined the information structure, we evaluated a number of applied sciences, together with Amazon EMR with Spark, Apache Flink, and Databricks. After rigorous benchmarking, Databricks emerged because the clear alternative, providing a novel mix of efficiency, simplicity, and ecosystem alignment that met the evolving wants of Freshworks.
A Unified Ecosystem for Information Processing
Fairly than stitching collectively a number of instruments, Databricks presents an end-to-end platform that spans job orchestration, knowledge governance, and CI/CD integration, decreasing complexity and accelerating improvement.
- Unity Catalog acts as the one supply of fact for knowledge governance. With granular entry management, lineage monitoring, and centralized schema administration, it ensures
- our staff is ready to safe all the information belongings well-organized knowledge entry for every tenant, preserving strict entry boundaries ,
- Be compliant to regulatory wants with all occasions & actions being audited within the audit tables together with data on who has entry to which belongings, and
- Databricks Jobs have inherent orchestration and changed reliance on exterior orchestrators like Airflow. Native scheduling and pipeline execution lowered operational friction and improved reliability.
- CI/CD and REST APIs helped Freshworks’ groups to automate all the things—from job creation, cluster scaling to schema updates. This automation has accelerated releases, improved consistency, and minimized handbook errors, permitting us to experiment quick and be taught quick.
Optimized Spark Platform
- Key capabilities like automated useful resource allocation, unified batch & streaming structure, executor fault restoration, and dynamic scaling to course of hundreds of thousands of information allowed us to keep up constant throughput, even throughout visitors spikes or infra hiccups.
Excessive-Efficiency Caching
- Databricks Disk Caching proved to be the important thing think about assembly the required knowledge latency, as most merges had been served from sizzling knowledge saved within the disk cache.
- Its functionality to mechanically detect adjustments in underlying knowledge information and maintain the cache up to date ensured that the batch processing intervals persistently met the required SLA.
Delta Lake: Basis for Actual-Time and Dependable Ingestion
Delta Lake performs a vital position within the pipeline, enabling low-latency, ACID-compliant, high-integrity knowledge processing at scale.
Delta Lake Characteristic | SaaS Pipeline Profit |
---|---|
ACID Transactions | Freshworks writes excessive frequency streaming from a number of sources & concurrent writes on the information. ACID compliance of Delta Lake, Ensures knowledge consistency of knowledge throughout the reads & writes. |
Schema Evolution | Because of the quick rising and inherent nature of the merchandise, the schema of assorted merchandise retains evolving and Delta lake’s schema evolution adapts to altering necessities the place they’re seamlessly utilized to delta tables & are mechanically picked up by spark streaming functions. |
Time Journey | With hundreds of thousands of transactions & audit wants, the flexibility to return to a snapshot of knowledge within the Delta Lake helps auditing and rollback to time limit wants. |
Scalable Change Dealing with & Deletion Vectors | Delta Lake helps & permits environment friendly insert/replace/delete operations by means of transaction logs with out rewriting massive knowledge information. This proved essential in decreasing ingestion latencies from hours to some minutes in our pipelines. |
Open Format | Freshworks being a multi-tenant SAAS system, the open Delta format offers broad compatibility with analytics instruments on high of the Lakehouse; supporting multi-tenant learn operations. |
So, by combining Databricks Spark’s pace, Delta Lake’s reliability, and Databricks’ built-in platform, we constructed a scalable, sturdy, and cost-effective future-ready basis for Freshworks’ real-time analytics.
What We Discovered: Key Insights
No transformation is with out its challenges. Alongside the best way, we encountered a couple of surprises that taught us priceless classes:
1. State Retailer Overhead: Excessive Reminiscence Footprint and Stability Points
Utilizing Spark’s dropDuplicatesWithinWatermark brought on excessive reminiscence use and instability, particularly throughout autoscaling, and led to elevated S3 listing prices because of many small information.
Repair: Switching to Delta-based caching for deduplication drastically improved reminiscence effectivity and stability. The general S3 listing value and reminiscence footprint had been drastically lowered, serving to to scale back the time and price of knowledge deduplication.
2. Liquid Clustering: Widespread Challenges
Clustering on a number of columns resulted in sparse knowledge distribution and elevated knowledge scans, decreasing question efficiency.
The queries had a main predicate with a number of secondary predicates; clustering on a number of columns led to a sparse distribution of knowledge on the first predicate column.
Repair: Clustering on a single main column led to raised file group and considerably quicker queries by optimizing knowledge scans.
3. Rubbish Assortment (GC) Points: Job Restarts Wanted
Lengthy-running jobs (7+ days) began experiencing efficiency slowness and extra frequent rubbish assortment cycles.
Repair: We needed to introduce weekly job restarts to mitigate extended GC cycles and efficiency degradation.
4. Information Skew: Dealing with Kafka Subject Imbalance
Information skew was noticed as totally different Kafka subjects had disproportionately various knowledge volumes. This led to uneven knowledge distribution throughout processing nodes, inflicting skewed activity workloads and non-uniform useful resource utilization.
Repair: Repartitioning earlier than transformations ensured an excellent and balanced knowledge distribution, balancing knowledge processing load and improved throughput.
5. Conditional Merge: Optimizing Merge Efficiency
Even when just a few columns had been wanted, the merge operations had been loading all columns from the goal desk, which led to excessive merge instances and I/O prices.
Repair: We applied an anti-join earlier than merge and early discard of late-arriving or irrelevant information, considerably dashing up merges by stopping pointless knowledge from being loaded.
Conclusion
By utilizing Databricks and Delta Lake, Freshworks has redefined its knowledge structure—transferring from fragmented, handbook workflows to a contemporary, unified, real-time platform.
The affect?
- 4x enchancment in knowledge sync time throughout visitors surges
- ~25% Value saving due to scalable, cost-efficient operations with zero downtime
- 50% discount in upkeep effort
- Excessive availability and SLA-compliant efficiency—even throughout peak hundreds
- Improved buyer expertise by way of real-time insights
This transformation empowers each buyer of Freshworks—from IT to Help—to make quicker, data-driven selections with out worrying concerning the knowledge quantity supporting their enterprise wants getting served and processed.