Thursday, April 3, 2025

Rockset’s innovative approach to vector search on a massive scale within the cloud revolves around constructing a novel data structure called the “packed inverted index.” This cutting-edge architecture enables rapid querying of large-scale datasets, thereby empowering users to uncover meaningful insights from their vast collections. The packed inverted index cleverly leverages the power of bit-packing and parallel processing to accelerate query execution, thus making it feasible to handle massive volumes of data with ease.

Within the past half-year, Rockset’s engineering team has successfully integrated similarity indexes into its comprehensive search and analytics platform.

Indexing has long been a core area of focus for Rockset’s team of experts. Rockset’s innovative Converged Index seamlessly integrates elements from various data structures: a search index, columnar store, row store, and – most notably – a similarity index capable of handling vast datasets comprising billions of vectors and terabytes of information. We’ve designed these indexes to facilitate real-time updates, enabling streaming data to become searchable within 200 milliseconds.

In the early stages of 2022, Rockset introduced a revolutionary cloud architecture that pioneered compute-storage and compute-compute segregation for unparalleled efficiency and scalability. Consequently, the ingestion of new vectors and metadata does not compromise search efficiency, as indexing processes are optimized to accommodate such updates seamlessly. Customers can continuously stream and index vectors remotely without being tied to a specific search location. This structure is beneficial for streaming knowledge and facilitates similarity indexing, as these tasks require significant computational resources.

What’s become increasingly apparent is that vector search isn’t isolated from other technologies; in fact, it’s interconnected with a broader ecosystem. Various applications employ filters in vector search, incorporating textual content, geographic data, temporal sequencing knowledge, and more. With Rockset, searching for data becomes as effortless as crafting a SQL WHERE statement. With Rockset’s innovative approach, searches leverage the power of a built-in SQL engine to execute queries efficiently and consistently.

On this blog, we’ll delve into how Rockset seamlessly integrates its search and analytics capabilities into a single, unified database. Rockset has architecturally resolved the complexities of native SQL, real-time updates, and compute-compute separation to deliver a seamless experience.

FAISS-IVF at Rockset

While Rockset’s similarity indexing is algorithm-agnostic by design, our initial development harnessed the benefits of, a widely adopted and well-documented solution that facilitated seamless updates.

Several techniques accompany constructing a graph, including tree knowledge construction and inverted file construction. Inefficient tree and graph constructions prolong the building process, incurring significant computational expenses and time lags that hinder efficient utilization of scenarios requiring frequent updates to vector data. The inverted file strategy is particularly well-suited due to its rapid indexing time and efficient search capabilities.

While the open-sourced solution offers scalability potential when used as a standalone index, customers are more likely to require a dedicated database optimized for handling and scaling vector search capabilities. Here is the rewritten text:

Rockset’s availability stems from its solutions to database challenges, including query optimization, multi-tenancy, sharding, consistency, and more – all crucial considerations for customers seeking scalable vector search capabilities.

Implementation of FAISS-IVF at Rockset

Here’s the improved text: As Rockset is designed to scale, it creates a distributed FAISS similarity index that leverages memory efficiency, facilitating rapid insertion and recall.

A consumer utilises a function to create a similarity index for any given vector within a Rockset collection, thereby enabling efficient discovery and analysis of correlated data. Underlying the inverted file indexing mechanism lies a partitioning strategy that divides the vast vector space into distinct areas, each assigned a centroid – a representative point embodying the essence of its contained data. Vectors are subsequently assigned to a partition, or cell, with primary consideration given to the centroid that is most proximal.

vg_ann_index = faiss.IndexIVFFlat(1536); vg_ann_index.addIndex(confluent_webinar.video_game_embeddings.embedding); 

Vectors are mapped onto Voronoi cells by FAISS. Each cell’s periphery is defined by a central point, known as its centroid.

During the construction of a similarity index, Rockset generates a posting list comprising centroid coordinates and corresponding identifiers, which is subsequently stored in memory. Every file in the assortment can also be listed, with additional fields added to store the nearest centroid and its corresponding residual – the offset or distance from that centroid. Data is persistently stored on solid-state drives (SSDs) for optimized performance and durability, while also being backed up by robust cloud-based object storage, thereby delivering a greater value proposition compared to traditional in-memory vector database solutions. As fresh data arrive, their proximal centroids and residual values are calculated and stored for future reference.

Vectors are assigned to Voronoi cells through FAISS. Each cell is characterized by a centroid that defines its boundary.

With Rockset’s capabilities, vector search can seamlessly leverage both similarity and search indexes in parallel. When performing a search, Rockset’s question optimiser effectively retrieves the most similar centroids to the target goal embedding from the FAISS library. Rockset’s question optimizer subsequently searches across the centroids using the search index to retrieve the outcome.

Rockset empowers users with the flexibility to balance recall and speed for their AI applications. During similarity index creation, consumers have the option to customize the threshold, which yields a faster search but also increases indexing time due to additional centroids. During Q&A sessions, consumers have the flexibility to select from various probe options or cell types, dynamically balancing speed and precision in their search efforts.

Rockset’s implementation optimises data storage by capping memory usage at a posting list, while leveraging similarity and search indices for enhanced efficiency.

Construct apps with real-time updates

Dealing with inserts, updates, and deletions is a significant challenge when using vector search. As a result of vector indexes being meticulously structured for efficient querying, attempting to replace them with novel vectors would rapidly compromise their fast lookup capabilities?

Rockset enables seamless updates to metadata and vectors while prioritizing environmental sustainability. Rockset is built upon an open-source embedded storage engine, specifically designed for mutability, a product of the workforce behind Rockset at Meta.

By leveraging RocksDB beneath the surface, Rockset enables support for field-level mutations, thereby triggering a request to FAISS whenever an update is made to a specific file’s vector – leading to the computation of a fresh centroid and residual. Rockset replaces solely the outdated values of the centroid and residual with a real-time, up-to-date vector discipline. This enables querying of freshly calculated or updated vectors within approximately 200 milliseconds.

Separation of indexing and search

Rockset ensures seamless indexing and querying of vectors, safeguarding optimal search performance even in situations where data is continuously streamed. Within Rockset’s architecture, a single digital occasion or cluster of compute nodes can be leveraged to ingest and index vast amounts of data, while separate instances are designed specifically for querying purposes. Multiple digital sources can simultaneously access the same data, rendering redundant replication unnecessary.

The compute-compute separation enables Rockset to efficiently facilitate concurrent indexing and search capabilities. Due to the inherent limitations of the system, it is often necessary to avoid concurrent reads and writes, which may lead to pressure to implement batch loading strategies during off-peak hours to maintain consistent search functionality and optimize overall utility performance.

To mitigate interference and maintain efficient search, regular retraining of similarity indexes is crucial, ensuring seamless computation-separation for optimal results. While periodic index retraining is a necessary step in maintaining data accuracy, it’s acknowledged that this process can be computationally expensive. In many cases, including data warehousing scenarios, reindexing and search operations typically take place within a single cluster. The introduction of indexing can potentially compromise the search efficiency by interfering with its application?

By decoupling computing from storage, Rockset eliminates the negative impact of indexing on seek performance, ensuring consistent and efficient query handling regardless of data size.

Hybrid search seamlessly integrates with existing SQL architectures, allowing developers to leverage the power of WHERE clauses while unlocking advanced search functionality.

Provide restricted assistance for hybrid searches or metadata filtering while prohibiting specific field types, updating metadata, and scaling metadata operations? Built specifically for search and analytics, Rockset elevates metadata to its core functionality, effortlessly handling files up to 40MB in size.

The primary justification for several novel vector databases limiting metadata is that accelerating knowledge filtration at an incredible pace proves to be an exceedingly arduous challenge. What’s your location? <filter>Wouldn’t it be advantageous to possess the capacity to quantify distinct filter types, assess their discriminatory power subsequently, and resequence, strategize and refine the search? While a costly optimization may prove necessary to alleviate the labor-intensive nature of this issue, dedicated databases such as Rockset have devoted considerable resources – indeed, years – to developing a cost-based optimizer, effectively mitigating these drawbacks.

By signing up with Rockset, you can consent to an approximate nearest neighbour search, trading precision for speed within the query, allowing you to quickly locate relevant results. approx_dot_product or approx_euclidean_dist.

WITH dune_embedding AS (     SELECT embedding     FROM commons.book_catalogue_embeddings catalogue     WHERE title="Dune"     LIMIT 1 ) SELECT title, writer, ranking, num_ratings, worth,             APPROX_DOT_PRODUCT(dune_embedding.embedding, book_catalogue_embeddings.embedding) similarity,     description, language, book_format, page_count, liked_percent FROM commons.book_catalogue_embeddings CROSS JOIN dune_embedding WHERE ranking IS NOT NULL              AND book_catalogue_embeddings.embedding IS NOT NULL              AND writer != 'Frank Herbert'              AND ranking > 4.0 ORDER BY similarity DESC LIMIT 30 

Rockset leverages its search index for filtering based on metadata, while also constraining the search to the nearest centroid clusters. Single-stage filtering, a technique distinct from the more complex approaches combining two-stage filtering, pre-filtering, and post-filtering, which can introduce latency.

Can a cloud-based scalable solution accelerate efficient querying of massive datasets by leveraging distributed indexing and optimized algorithms for similarity searches?

With extensive experience in building a robust search and analytics database designed to accommodate large-scale applications at Rockset. This cloud-native architecture has been deliberately designed from the ground up to provide vital resource isolation, a crucial feature for building mission-critical applications or those requiring constant uptime around the clock. As buyers process increasingly large data volumes, Rockset has seamlessly scaled to meet their needs while maintaining a P50 knowledge latency of just 10 milliseconds.

As a result, companies are now leveraging vector search for large-scale, production-oriented applications. The information chief at Airways uses Rockset as a vector search database to inform operational decisions about flights, crews, and passengers through large language model (LLM)-based chatbots. The US’s most rapidly growing market leverages Rockset to fuel AI-driven recommendations on its publicly traded auction platform.

When building an AI-driven utility, we encourage you to begin with Rockset or explore more about our expertise and how it can be applied to your specific use case.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles