Rockset is a cloud-based database designed specifically for real-time search and analytics on large-scale datasets, particularly those with rapidly changing or streaming data. When dealing with massive datasets, our customers typically demand the highest possible throughput and minimum latency in their analytics endeavors. Against this backdrop, we’re often asked to benchmark Rockset’s performance against other databases, focusing on its ability to deliver exceptional speed and low knowledge latency. We decided to investigate the streaming ingestion efficiency of Rockset’s next-generation cloud architecture by comparing it with an open-source search engine, a popular sink for Apache Kafka.
For this benchmark, we assessed the ingestional efficiency of both Rockset and Elasticsearch with respect to throughput and knowledge latency. Throughput gauges the velocity of information processing, influencing a database’s capacity to efficiently support fast-paced knowledge flows. Knowledge latency refers to the time gap between ingesting and indexing information, making it available for querying, which significantly impacts a database’s ability to provide timely and accurate results. When evaluating performance, we focus on latency metrics at the 95th and 99th percentiles, given that databases utilized in manufacturing settings demand reliable and consistent processing.
We found that Rockset outperformed Elasticsearch in both throughput and end-to-end latency at the 99th percentile mark. for streaming knowledge ingestion.
On this blog, we’ll walk through the benchmark framework, configuration, and results. Let’s dive beneath the surface of these two databases to better comprehend the disparity in their performance with regards to search and analytics on high-speed data flows.
Why measure streaming knowledge ingestion?
Streaming knowledge is surging forward as more than a hundred corporations leverage Apache Kafka to revolutionize their data architectures. The gaming, web, and financial services sectors have reached maturity in adopting occasion streaming platforms, having evolved beyond knowledge streams to harness the power of torrents. It is crucial to understand the dimensional capacities of persistent data repositories like Rockset and Elasticsearch, as they enable the ingestion and indexing of information for seamless real-time search and analytics capabilities.
Organizations combine a real-time event-streaming platform like Confluent Cloud, Apache Kafka, or Amazon Kinesis with a downstream database to enable instant use cases that incorporate personalization, anomaly detection, and logistics monitoring. By combining a database like Rockset or Elasticsearch with other tools and technologies, numerous advantages emerge.
- Enhancing query processing through confluences of historical archives and real-time data streams
- Supporting real-time transformations and rollups during data ingestion.
- When the knowledge mannequin is in a state of flux?
- When complex query patterns necessitate specific indexing techniques.
Moreover, the rapid pace of advancements in search and analytics capabilities is continually narrowing the window of opportunity, leaving businesses with a limited timeframe to react and take decisive action. Databases designed with real-time data streaming in mind can efficiently handle incoming events as they occur, avoiding slow batch processing and ensuring timely insights.
Let’s dive into the benchmark to gain a comprehensive understanding of the streaming ingest efficiency achievable with Rockset and Elasticsearch.
Rockbench provides a comprehensive suite of tools to analyze, test, and optimize application performance, facilitating precise measurement of both throughput and latency.
We assessed the streaming ingestion performance of Rockset and Elasticsearch on a benchmark that evaluates the peak throughput and end-to-end latency of databases, measuring their efficiency in processing high-volume data streams.
Features two components: a knowledge creation module and a performance assessment tool. The information generator processes events at a rate of one per second, committing them to the database for storage; meanwhile, the metrics evaluator continuously monitors and records both the system’s overall throughput as well as its end-to-end latency – specifically, measuring the time elapsed from when an event is generated until it becomes queryable.
The information generator produces documentation, with each document measuring precisely 1.25 kilobytes in size, serving as a unique record of a specific incident. That translates to approximately 8,000 words being written at a rate of 10 megabytes per second.
Peak throughput is the maximum rate at which a database can process transactions without accumulating a perpetual backlog. To assess this performance metric, we incrementally injected knowledge into the system at a rate of 10 megabytes per second until the underlying database became unable to maintain a consistent throughput for a period of 45 minutes. We determined the optimal throughput due to concerns that increases beyond 10 MB/s might compromise the database’s ability to sustain efficient writing.
Documents frequently contain 60 fields comprising complex structures of nested objects and arrays, effectively capturing the inherent complexity of real-world scenarios. The documentation also includes several key fields utilized in calculating the end-to-end latency.
_id
The unique documentation identifier._event_time
Displays the current clock time for the generator machine.generator_identifier
: 64-bit random quantity
The _event_time
The latency of a document is calculated by subtracting its creation date from the current system time, providing insight into how up-to-date the content is. This measurement also encompasses round-trip latency – the time taken for a query to transmit its request, receive results from the database, and return them to the end-user. Metrics are published to a Prometheus server, whereupon p50, p95, and p99 latency calculations are performed across all evaluators.
The efficiency analysis reveals that the information generator adds new records to the database without replacing existing ones.
RockBench Configuration & Outcomes
To assess the scalability of ingestion and indexing performance in Rockset and Elasticsearch, we employed two distinct configurations featuring significantly varying computational and memory resources. We selected an instance that closely matches the CPU and memory allocations typical for Rockset’s digital scenarios. The various configurations utilized multiple processors.
Turbines and knowledge latency evaluators for Rockset and Elasticsearch were successfully executed on their respective cloud platforms, as well as in the US West 2 region to ensure regional compatibility. We selected Elastic’s Elasticsearch solution on Azure, leveraging the cloud provider’s offering of Intel Ice Lake processors for our infrastructure needs. The information generator leveraged Rockset’s write API to efficiently ingest new documents into the database.
We executed the Elasticsearch benchmark on the Elastic-managed Elasticsearch service, utilizing its most up-to-date and secure configuration, comprising 32 primary shards, a solitary replication instance, and a single availability zone. After conducting an array of distinct refresh intervals to optimize performance, we ultimately settled on a refresh interval of 1 second, coincidentally aligning with Elasticsearch’s default configuration. After careful consideration of the data’s efficiency, we ultimately selected a 32-shard configuration, building upon our analysis of 64- and 32-shard models, taking into account varying shard dimensions spanning 10 GB to 50 GB. To guarantee optimal performance, we meticulously verified that shard allocation was evenly dispersed across all nodes, with rebalancing deliberately disabled to maintain a stable state.
Rockset’s SaaS-based architecture streamlines cluster management, seamlessly handling shards, replicas, and indexing operations behind the scenes. Comparable efficiency can be expected when running everyday workloads on Rockset, similar to the results seen in the RockBench benchmark.
To validate the performance of the databases under varying workloads, we conducted a benchmark using write requests with batch sizes of 50 and 500 documents each, demonstrating their capacity to handle increased write traffic. We chose batch sizes of fifty and 500 for paperwork, mirroring scenarios where incrementally updated streams and excessive data flows require similar handling capacities.
Notably, Rockset achieves a throughput of up to four times that of Elasticsearch.
Peak throughput is the highest level of performance at which a database can operate consistently without experiencing an increasing accumulation of unprocessed requests or data. Notably, benchmarks demonstrating a batch size of 50 reveal that Rockset can realize up to four times the throughput compared to Elasticsearch.
Rockset’s performance is remarkable, achieving up to fourfold increased throughput compared to Elasticsearch when processing datasets with a batch size of fifty.
Rockset, with its batch dimension of 500, delivers up to 1.6 times the throughput of Elasticsearch.
Elasticsearch tends to perform more efficiently when processing larger batch sizes compared to smaller ones in accordance with the findings of the efficiency benchmark. The platform suggests leveraging bulk requests for superior efficiency compared to individual-document index queries. While Elasticsearch may struggle with smaller batch sizes due to its design focus on handling large batches, Rockset excels at processing incrementally updating streams, achieving higher throughput efficiency as a result.
As asset volume increases on both Rockset and Elasticsearch, height throughput is found to scale linearly. While Rockset outperforms Elasticsearch in terms of throughput on RockBench, particularly with heavy write loads, its suitability for such workloads is thereby enhanced.
Knowledge latency drops by up to 75% when using Rockset compared to Elasticsearch.
We investigate the end-to-end latency of both Rockset and Elasticsearch at their respective maximum throughput capacities. We initiate assessment of information latency by employing a 1-terabyte dataset dimension, examining average knowledge propagation latency across a 45-minute interval under optimal throughput conditions.
We observe that for a batch size of 50, Rockset achieves an impressive throughput of 90 MB per second, whereas Elasticsearch reaches a maximum throughput of 50 MB per second. When assessing performance at a batch size of 500, Rockset achieves an impressive throughput of up to 110 MB per second, whereas Elasticsearch maxes out at 80 MB per second in similar conditions.
Desk of the 50th, 95th, and 99th percentile knowledge latencies on batch sizes of 50 and 500 in Rockset and Elasticsearch: Latency measurements are taken in scenarios involving 128 virtual CPUs.
On the 95th and 99th percentiles, Rockset outperforms Elasticsearch in terms of knowledge latency at peak throughput. You may observe that the information latency exhibits a tighter bound within Rockset compared to the difference between p50 and p99 in Elasticsearch.
According to reports, Rockset had the potential to achieve a remarkable 2.5-fold reduction in latency compared to Elasticsearch when processing real-time data ingestions.
How did we achieve success? Rockset harnesses the power of cloud-native efficiency to deliver its benefits.
Can database realization ensure real-time efficiency in handling concurrent transactions?
The de facto standard structure for real-time database techniques, often combined with Elasticsearch, features a shared-nothing architecture where compute and storage resources are closely integrated to enhance efficiency. With these outcomes, we demonstrate the potential of a disaggregated architecture to support seamless search and analytics capabilities on high-velocity streaming data.
Cloud-native architecture is built on the foundation of resource decoupling, with compute-storage separation being a prominent example, enabling greater scalability and efficiency. You don’t need to overprovision assets to achieve peak capability, as scaling up or down is possible on demand. You possibly can provision the exact amount of storage and compute needed for your application.
While decoupled architectures have been criticized for sacrificing efficiency in favor of isolation. In a distributed architecture, the seamless integration of assets fosters high-performance efficiency by leveraging identical compute models for both knowledge ingestion and question processing, thereby ensuring the latest data is readily available for querying. Storage and compute capabilities are collocated on the same nodes to facilitate faster data ingestion and enhanced query performance.
With advancements in cloud architectures, tightly coupled systems are no longer a necessity. By separating compute-intensive tasks from low-latency storage requirements, Rockset’s innovative architecture paved the way for seamless execution of real-time search and analytics applications. Rockset is poised to ensure seamless query execution, leveraging its capability to replicate the latest writes by mirroring the in-memory state across a distributed cluster of compute and memory resources, rendering it exceptionally well-suited for handling latency-sensitive situations. Additionally, Rockset introduces a highly scalable and performance-optimized storage layer that can serve as a shared asset for various applications.
With compute-compute separation, Rockset outperforms Elasticsearch in terms of ingestion efficiency, since it only needs to process incoming data once. In Elasticsearch, every replica node must dedicate computing resources to indexing and compaction of newly written data. During a solitary digital event, the data is indexed and compacted prior to being transferred to other locations for software deployment. What drives Rockset’s efficiency benefits in processing incoming writes is the reason it has achieved up to 4 times higher throughput and a 2.5-fold reduction in end-to-end latency compared to Elasticsearch on RockBench.
Abstract: By leveraging innovative architecture, Rockset delivers a remarkable boost in performance, boasting up to 4-fold increased throughput and a substantial 2.5-fold reduction in latency.
Here is the improved text in a different style:
Having explored the efficiency analysis of Rockset and Elasticsearch for handling high-speed data streams on our blog, we’re now poised to draw meaningful conclusions.
Rockset enables significantly higher throughput compared to Elasticsearch, processing incoming data streams up to four times faster. We arrived at this finding by quantifying the height of throughput, or the rate at which knowledge latency exhibits a monotonic rise, across distinct batch sizes and configurations.
Rockset consistently outperforms Elasticsearch in terms of latency, boasting faster query response times at both the 95th and 99th percentiles. This makes Rockset an excellent choice for latency-sensitive applications that require real-time data access. Rockset claims to provide up to a 2.5x reduction in end-to-end latency compared to Elasticsearch.
We compared the streaming ingest efficiency of Rockset and Elasticsearch on hardware assets, using equivalent allocations of CPU and memory. We further found that Rockset offers the most compelling value proposition. By leveraging Rockset at the same value level, you can’t solely achieve higher efficiency; instead, you gain the flexibility to eliminate the complexities of managing clusters, shards, nodes, and indexes. This streamlined solution enables your team to focus on building high-quality features.
We conducted an efficiency benchmark on Rockset’s cloud architecture in its subsequent era, utilizing compute-compute separation. By isolating streaming ingestion, compute, and storage, Rockset demonstrated superior performance compared to Elasticsearch despite being in a unique position.
Join CTO Dhruba Borthakur and founding engineer and architect Igor Candani for a discussion on the performance optimization of Rockset and Elasticsearch. They explore the intricacies of high-level architecture, examining the nuances of design and optimization techniques.
Consider integrating Rockset as your personal real-time search and analytics workhorse by starting a.
With our newly integrated connectors to Confluent Cloud, Kafka, and Kinesis, along with a range of OLTP databases, we’ve made it effortlessly straightforward for you to get started.
Richard Lin, Senior Software Program Engineer; Julie Mills, Product Marketing Lead