Breaking Dangerous… Information Silos
Have we really uncovered the most effective approach to avoiding relational databases? While many organizations have successfully adopted Apache Kafka as their go-to technology for event-driven architectures, the platform still falls short of seamlessly replacing traditional relational databases like PostgreSQL in modern software stacks. Regardless of future developments in database technology, it is imperative that we address the persistent problem of information silos. Rockset has partnered with Confluent, the pioneer behind Apache Kafka’s cloud-native information streaming platform. By integrating fully managed providers, we’ve created a seamless solution that breaks down relational database barriers and delivers real-time analytics capabilities for modern data applications.
I stumbled upon my first meaningful introduction to databases during a university course taught by Professor Karen Davis, who is currently a distinguished educator at Miami University in Oxford, Ohio. Funded by an NSF grant, a senior project focused primarily on Perl programming, which unexpectedly steered me towards my current position. As a result, databases have become an integral part of both my professional and personal life, as well as a ubiquitous aspect of many people’s daily routines.
As part of my commitment to transparency, I would like to note that I am a former employee of Confluent and currently work at Rockset. I frequently discussed the concept of “Stream and Desk Duality” at Confluent. The idea that a desk can spawn a stream, which in turn can be transformed back into a desk, is a notion that blurs the lines between reality and abstraction. The connection is explained in a typical sequence, following the common practice of presenting data in tabular form, which allows individuals to easily organize and analyze the information. While residing within the database itself, each event originates as a recorded instance within a log. While implementations may vary, most databases store a sequence of events as a transaction log or journal, which is then reorganized into a table internally.
If an organisation relies on a single database, you’re probably off the hook, as data silos are not your primary concern. Data integration capabilities are crucial for seamless information exchange across disparate databases. The merchandise and instruments used for this task are virtually indistinguishable from one another, as they typically serve the same purpose through various means. Although the concept of Change Information Seize (CDC) has been around for some time, specific implementations have taken on various forms.
The most striking among these innovations is real-time CDC, enabled by the same inner database logging methods that build tables in real-time. While traditional change data capture methods like query-based CDC, file diffs, and full desk overwrites can be effective, they often compromise on information freshness and native database impact.
In 2009, the renowned GoldenGate software company launched its flagship product, which has remained a trusted solution for real-time change data capture (CDC) across various supply methods to this day. To operate as a real-time CDC, we must be occurrence-driven, with instantaneous responses rather than batch processing that limits our capacity for timely adjustment.
Actual-Time CDC Is The Approach
Now that you’re likely intrigued about how Rockset and Confluent empower you to bridge information silos with real-time change data capture, let’s dive into the benefits of their collaboration. While relying on your database’s alternative, specifically one equipped with a transaction log capable of generating real-time change data capture (CDC) events for immediate usage. While PostgreSQL, MySQL, SQL Server, and Oracle are popular choices, there are many other options that can provide advantages. We’ll focus on PostgreSQL, yet the concepts will remain applicable regardless of the database used.
Can we develop an appliance that generates CDC (Change Data Capture) events in real-time from PostgreSQL? Among the options available, Confluent Cloud boasts a built-in, fully managed instance of Debezium’s open-source connector, offering seamless integration with Apache Kafka for change data capture purposes. This connector is specifically engineered to monitor row-level modifications following an initial snapshot and writes the resulting data to Confluent Cloud topics. Capturing moments in this way proves to be extremely convenient, as it enables the creation of a high-quality information flow with embedded support and accessibility features.
Confluent Cloud offers a compelling alternative for storing and processing real-time change data capture (CDC) events in a scalable and reliable manner. While Confluent Cloud offers numerous benefits, perhaps most critical is the substantial reduction in operational overhead. Without Confluent Cloud, you’d be investing weeks in setting up a Kafka cluster, only to spend months mastering and integrating the necessary security measures, all while assigning a dedicated team to maintain its ongoing operation indefinitely? With Confluent Cloud, get instant access to scalable event-driven architecture in mere minutes using just your bank card and a web browser. Visit Confluent’s website for more information.
Finally, but by no means least, Rockset will likely be configured to learn from Confluent Cloud events and CDC occurrences, integrating them into a dataset that resembles our source table closely. Rockset presents three core alternatives for handling CDC scenarios at your fingertips.
- Rockset seamlessly integrates with multiple sources as part of its comprehensive managed service. Like Confluent’s managed PostgreSQL Change Data Capture (CDC) connector, Rockset offers a managed solution for processing real-time data changes from various sources. With a solid grasp of your supply model, as the foundation for each workstation, you likely have all the necessary components to navigate these situations effectively.
- With Rockset’s schemaless ingestion model, data can adapt and evolve without disrupting the underlying structure, ensuring seamless integration and analysis. As part of our technology stack, we have remained schemaless since 2019, as outlined in a previous blog post. It’s crucial to prioritize flexibility in CDC systems, allowing for seamless integration of novel attributes without necessitating cumbersome updates or halting system modifications.
- Rockset’s mutability empowers it to handle changes to existing data in a manner consistent with the original source database, typically via upserts or deletions. Unlike other extremely listed methods that necessitate arduous reprocessing and reindexing efforts, this feature uniquely benefits Rockset by eliminating the need for such laborious adjustments.
Databases and information warehouses often struggle with elongated ETL or ELT pipelines when lacking these options, resulting in heightened information latency and complexity. Rockset typically maps one-to-one between supply and goal objects, requiring minimal or no additional transformations. You’ve always thought that if you can envision the blueprint, you’ll be able to bring it into being. The design drawings for this structure are both elegant and effortless. Discover below the meticulously designed outline for this tutorial, carefully crafted to ensure seamless execution. I’m going to divide the tutorial into two crucial segments: Organizing Confluent Cloud and Organizing Rockset.
Streaming Issues With Confluent Cloud
The first step in our tutorial involves configuring Confluent Cloud to ingest our change data from PostgreSQL effectively. For those without an existing account, sign-up is free and hassle-free. Moreover, Confluent provides a dedicated UI for organizing the PostgreSQL CDC connector in Confluent Cloud. Noteworthy configuration nuances deserving attention:
- Whether Rockset’s after state solely condition is set to “true” or “false”, it can course. Assuming a default value of “true” for our functions, we will proceed with the remainder of the tutorial.
- “Ensure that ‘output.information.format’ is configured as either ‘JSON’ or ‘AVRO’ for seamless data processing.” Currently, Rockset does not support processing of “PROTOBUF” or “JSON_SR” data formats. If you’re unsure about leveraging Schema Registry or merely setting one up for Rockset, “JSON” remains a straightforward approach.
- To minimize unnecessary operations and ensure a single successful deletion in Rockset, set “Tombstones on delete” to “false”, thereby suppressing redundant tombstone creation.
-
To achieve the desired outcome, I also had to adjust the desk’s replication ID to “full”, although this setting can likely be preconfigured in your database.
ALTER TABLE cdc.demo.occasions REPLICA IDENTITY FULL;
- Consider allocating a dedicated connector for high-frequency adjustment tables, as the “duties.max” limitation restricts each connector to a single instance. To avoid unnecessary screening by the connector, ensure that you utilize the “desk.includelist” feature to specify a subset of non-system tables that should be screened, as all others are filtered out by default.
The various configurations required for your environment should not impact the interaction between Rockset and Confluent Cloud, respectively. When encountering issues between integrating PostgreSQL and Confluent Cloud, common roadblocks often stem from misconfigured logging settings in PostgreSQL, permission discrepancies across systems, or network connectivity problems. While troubleshooting a blog post can be challenging, I recommend thoroughly reviewing the documentation and reaching out to Confluent’s support team for expert assistance. Once you’ve finished all the tasks thus far, it’s advisable to review your progress in Confluent Cloud.
Actual Time With Rockset
As PostgreSQL change data capture events now flow through Confluent Cloud, configuring Rockset to process these events is the next step. The good news is that setting up an integration with Confluent Cloud is almost as straightforward as configuring the PostgreSQL Change Data Capture (CDC) connector, requiring minimal effort and expertise. Can you integrate Rockset with Confluent Cloud using the console? Completed programmatically using either our highly customizable or powerful APIs, which offer a similar level of flexibility while providing a more streamlined visual experience.
. Add a brand new integration.
. Select the Confluent Cloud tile from the catalog.
. Complete the configuration fields alongside Schema Registry whenever utilizing Avro.
. Crafting a novel collection from this synergy.
. The information supply configuration for this project will be as follows: all raw materials will be sourced from reputable suppliers within a 50-mile radius to minimize transportation costs and ensure timely delivery. The team will work closely with procurement to establish relationships with these vendors, leveraging their expertise to optimize inventory levels and reduce waste.
- Matter title
- Starting from the origin (prioritize early intervention when dealing with a relatively minor or stationary issue).
- The data will be presented in a structured format, leveraging the versatility of JSON to efficiently convey complex information.
. Select the Debezium template in CDC codecs, choosing the major key. The default Debezium template assumes that each of us has both an earlier-than and an after image. Given that our use case doesn’t require a custom SQL transformation, the solution is straightforward:
SELECT CASE WHEN enter.__deleted = 'true' THEN 'DELETE' ELSE 'UPSERT' END AS _op, STRING(CAST(_input.event_id AS INT)) AS _id, TIMESTAMP_MICROS(CAST(_input.event_timestamp AS INTEGER)) as event_timestamp, *_input EXCEPT(event_id, event_timestamp, __deleted) FROM _input
Rockset provides extensive template assistance for numerous recurring CDC events, including customizable “_op” options tailored to meet specific requirements. In our implementation, we exclusively handle deletions, treating all other operations as upserts.
. The project team will finalize the workspace, title, and outline for the new initiative by the end of the week. To ensure accurate and compliant materialization of CDC data, it is essential that you consistently configure retention coverage to maintain all associated documentation.
Once the gathering state indicates “Prepared”, you are authorized to initiate query processing. Within mere minutes, you can assemble a group that mirrors your PostgreSQL setup, ensuring real-time synchronization with mere 1-2 second data latency and poised to execute high-performance queries with sub-millisecond accuracy.
When discussing queries, you can easily convert your inquiry into a Question Lambda, which represents a managed question service that streamlines the process. You simply write your queries within the query editor and execute them against your data via a REST endpoint managed by Rockset. As we monitor adjustments to the query over time through various iterations, we will also track key performance indicators for each frequency and latency, providing valuable insights into our progress. It’s a technique that seamlessly flips your data-as-a-service mindset into a query-as-a-service approach, eliminating the need to build your own SQL engine and API layer.
The Superb Database Race
As a novice herpetologist and enthusiastic biology enthusiast, I’ve observed that expertise often evolves through deliberate selection. In cases involving database management, the seemingly ‘perfect’ solution can occasionally present an air of artificiality. Early databases were constrained by rigid formats and constructions, yet remained predictably efficient. As the data frenzy reached its zenith, we inadvertently sacrificed rigor by establishing a department specializing in NoSQL databases, notorious for their permissive approach to data models and subpar performance. Firms today are increasingly adopting real-time decision-making as a key business strategy, seeking solutions that seamlessly balance efficiency and flexibility to power their real-time decision-making ecosystems.
As a revolutionary step forward, Rockset and Confluent have emerged from the sea of traditional data processing, successfully migrating to the shore of real-time analytics. Rockset’s ability to handle excessive frequency ingestion, diverse information formats, and interactive query workloads sets it apart, pioneering a new breed of databases that will become increasingly prevalent. Confluent has established itself as the industry standard for real-time data streaming with its flagship product Kafka, driving innovation in event-driven architectures. Collectively, they provide a seamless CDC analytics pipeline, enabling users to access real-time insights with no coding required and without the need for any underlying infrastructure. By focusing on the key drivers behind your online business, you can swiftly unlock the value hidden within your data.
Get started now with a free trial of both Rockset and [insert name]. Sign up for Confluent Cloud and get a $400 credit to use within your first 30 days – with no credit card needed. This offer has a comparable agreement – a $300 credit line with no need for a bank-issued credit card.