We’re thrilled to unveil vector search capabilities in Rockset, empowering developers to craft lightning-fast, eco-friendly search experiences that seamlessly integrate with personalization engines, fraud detection methodologies, and more. To showcase these enhanced features, we developed a search demonstration featuring Amazon product descriptions, leveraging embeddings to generate relevant search results alongside Rockset’s capabilities. In this live demo, witness Rockset’s ability to rapidly deliver search results in a mere 15 milliseconds, processing thousands of documents with unparalleled speed and efficiency.
Why use vector search?
Companies continue to accumulate vast amounts of unorganized data, encompassing everything from written documents to multimedia files to machine-generated and sensor-collected information. Estimates suggest that unstructured data accounts for approximately 80% of an organization’s total data assets, yet companies often utilize only a minuscule portion of this wealth of information to uncover valuable insights, inform strategic decisions, and craft captivating experiences. Unlocking insights from unstructured data poses a persistent challenge, demanding both technical expertise and domain-specific knowledge. As a consequence of these challenges, vast amounts of unorganized data have continued to languish untapped.
As machine learning, neural networks, and large language models continue to evolve, companies can seamlessly transform unstructured data into embeddings, often manifesting as vectors that facilitate effortless analysis and insight generation. Vector search algorithms process vast amounts of data, identifying recurring patterns and measuring the likeness between disparate elements within unstructured information.
Prior to the advent of vector search, traditional search methods mainly relied on key phrase searches, which typically involved laborious manual tagging to connect relevant results. Manual paperwork tagging necessitates a series of procedures including establishing categorization frameworks, grasping search patterns, examining existing documentation, and maintaining tailored rule sets. To effectively retrieve tagged key phrases for shipping product outcomes, we would manually tag “Fortnite” as both a “survival game” and a “multiplayer game.” We would also need to identify and tag phrases with similarities to “survival game,” such as “battle royale” and “open-world play,” in order to deliver relevant search results.
In recent times, the reliance of key phrase searches on temporal proximity has become increasingly tied to tokenization. Tokenization involves parsing titles, descriptions, and documents into individual words and word fragments, followed by term proximity features that return results based on matching between these discrete phrases and search terms. Although tokenization simplifies manual tagging and standardizes search results, key phrase search still falls short in returning semantically equivalent matches, especially within the realm of natural language processing where contextual associations between phrases are paramount.
By harnessing the power of text-based embeddings, our vector search technology will excel at capturing subtle semantic connections across phrases, sentences, and paragraphs, ultimately fueling more robust and accurate search results. To identify video games that feature “open-world exploration” and “multiplayer options,” including elements like house and journey, we can leverage vector search algorithms. By automating this process rather than relying on manual tagging or tokenization of game descriptions, we can deliver more accurate and relevant results.
Embeddings transform input data into dense vectors in a high-dimensional space, allowing for efficient similarity searches. When searching for similar vectors based on their energy, the process involves:
Energy-based similarity: Compute the dot product or cosine similarity between two vectors to quantify their “energy” alignment.
Nearest Neighbor Search: Utilize approximate nearest neighbor (ANN) algorithms like HNSW, Annoy, or FAISS to efficiently find top-k most similar vectors in the high-dimensional space. These libraries leverage data structures and indexing techniques to accelerate search times.
Scalability: To accommodate large datasets, implement clustering or hierarchical indexing strategies to reduce the number of distance computations required for each query.
Pruning: Leverage vector similarity heuristics, such as cosine similarity or dot product, to efficiently prune the search space by discarding dissimilar vectors early on. This drastically reduces computational overhead while maintaining accuracy.
The overall process is designed to balance speed and accuracy, ensuring efficient energy-based vector searches that can handle large-scale datasets.
Here are some key concepts that can be used to describe how embeddings capture the meaning of unstructured data:
By leveraging embeddings, we can uncover the relationships between phrases such as “Fortnite”, “PUBG” and “Battle Royale”. Fashion models create these mappings by projecting words onto a high-dimensional space, allowing related terms to cluster together. A humanoid figure in a two-dimensional environment generates specific spatial coordinates (x, y) over successive intervals, allowing for the analysis of patterns and relationships through measurements of distance and angular differences between adjacent data points.
Unpacking real-world functions, unstructured data can comprise countless knowledge elements, seamlessly converting into high-dimensional embeddings featuring thousands of features. Vector search analyzes these types of embeddings to identify phrases in close proximity, such as “Fortnite” and “PUBG”, as well as phrases that could be even closer together and synonyms like “PlayerUnknown’s Battlegrounds” and its related acronym “PUBG”.
With significant advancements in both accuracy and accessibility, vector search has experienced a meteoric rise in popularity due to the widespread adoption of techniques used to create high-quality embeddings. The proliferation of embedding techniques has catalysed a surge in the development of sophisticated natural language processing capabilities, yielding high-dimensional representations that enable unprecedented levels of linguistic comprehension? OpenAI’s textual content embedding model produces representations with 1,526 dimensions, yielding a rich and detailed portrayal of the underlying linguistic structure.
What drives innovation in data exploration? For us, it’s about empowering users to uncover insights quickly while minimizing the environmental impact of their queries.
With embeddings generated for our unstructured data, we will pivot towards vector search techniques to identify similarities within these representations. Rockset offers a range of out-of-the-box distance features alongside tools for calculating the similarity between embeddings and search inputs, empowering users to quickly identify relevant data points. Using these similarity scores, we will employ Okay-Nearest Neighbors (kNN) search on Rockset to retrieve the top k most similar embeddings matching the search input.
Rockset has harnessed its recently unveiled vector operations and distance functionalities to empower robust vector search capabilities. Rockset has expanded its suite of real-time search and analytics capabilities by integrating with leading vector databases such as Milvus, Pinecone, and Weaviate, as well as alternative solutions like . This integration enables efficient indexing and storage of vectors. Beneath its surface, Rockset leverages its cutting-edge Converged Index technology to deliver unparalleled performance in metadata filtering, vector search, and key phrase search capabilities, seamlessly supporting lightning-fast sub-second queries, complex aggregations, and massive-scale joins.
Rockset provides a range of benefits alongside vector search assistance to facilitate connected experiences.
- Actual Time Knowledge: Enables seamless ingestion and indexing of rapidly evolving information streams, with instant support for timely updates.
- Streamlined Integration Era: Consolidate knowledge through iterative processing, unlocking optimized solutions and minimizing data storage needs.
- Efficient Search: Combine vector search with targeted metadata filtering to deliver rapid, eco-friendly results quickly.
- Hybrid Search Plus Analytics: Seamlessly combine disparate data insights with your vector search results to deliver rich, highly relevant experiences via SQL.
-
Our Absolute Managed Cloud Service empowers seamless execution of processes on a highly scalable, accessible cloud-native database, featuring distinct compute-storage and compute-compute layers for optimal scaling and reduced costs.
Constructing Product Search Suggestions
Let’s conduct a run semantic search using OpenAI and Rockset to discover relevant products on Amazon.com?
For this demonstration, we utilized publicly available product information, combined with product listings and reviews to provide a comprehensive understanding.
Generate Embeddings
The initial phase of this tutorial involves leveraging a technique to produce vector representations from Amazon product descriptions. We leveraged OpenAI’s model due to its exceptional performance, ease of access, and reduced embedding size. Despite utilising various methods to produce these embeddings, we considered numerous approaches that enable customers to execute them locally.
The Mannequin AI model produces a single embedded vector comprised of 1,536 distinct components. We will generate and save embeddings for 8,592 product descriptions of video games listed on Amazon. Here’s a possible improvement: We can create an embedding that captures the essence of the search query, which encompasses both “house” and “journey,” as well as open-world play and multiplayer options.
Here are the improvements:
We employ a specific code to produce embeddings.
Add Embeddings to Rockset
We’ll integrate these embeddings alongside product details into Rockset, establishing a novel assortment for initiating vector searches within the second step. Here’s a simplified overview of the process:
Within Rockset, we establish a new group by uploading the previously generated file, which combines product listings from an online game with their corresponding embeddings. We could have seamlessly integrated data from various sources, including cloud-based storage solutions like Amazon S3 and Snowflake, as well as real-time processing systems such as Kafka and Amazon Kinesis, by utilizing Rockset’s pre-built connectors. Utilizing Ingest Transformations, we rework the data via a SQL-driven ingest process. We use Rockset’s new VECTOR_ENFORCE
Ensuring consistent array sizes and compositions throughout query processing by validating the dimensions and elements of incoming arrays, thereby maintaining compatibility between vectors during execution.
Run Vector Search on Rockset
We’ll now execute a vector search on Rockset leveraging the recently introduced distance capabilities. COSINE_SIM
takes two arguments: a description embedding within the discipline and a search query embedding. With Rockset’s comprehensive SQL capabilities, you gain unparalleled flexibility to harness the power of your data, effortlessly unlocking new insights and opportunities.
During this exercise, we replicated the search query embed by copying and pasting it directly into COSINE_SIM
operate inside the SELECT
assertion. Alternatively, we could have obtained the real-time embedding by seamlessly integrating with the OpenAI Textual Content Embedding API, feeding the resulting output directly into Rockset’s Question Lambda parameter.
As a direct outcome of Rockset’s Converged Index, k-Nearest Neighbor (kNN) search queries execute with remarkable efficiency when utilizing selective metadata filtering. RockSet applies these filters prior to calculating similarity scores, thereby optimizing the search process by efficiently evaluating scores only for relevant documents. To refine our vector search query, we apply filters based on price range and game genre to ensure results fall within a predetermined value scope and are compatible with a specific platform.
In just 15 milliseconds, Rockset’s optimized architecture enables rapid computation on a massive dataset of over 8,500 papers, returning the top 5 results with impressive speed on a high-end setup featuring 16 vCPUs and 128 GB of dedicated memory. Here is the improved text in a different style:
The top three outcomes for searches related to “house and journey, open-world play, and multiplayer options” are:
- Embark on an immersive role-playing adventure for 1-4 players, transporting yourself to a captivating realm of fantasy and discovery as you usher in a fresh chapter in this thrilling campaign.
- The spaceman crash-landed on a bizarre planet and is eager to locate the various components of his spacecraft. The issue? With mere hours left on the clock, he must act swiftly!
- A jolt of adrenaline straight to the soul: Are you ready for a 180-MPH wake-up call? Multiplayer modes accommodate up to four players in cooperative play, featuring a range of options including the fast-paced action of Deathmatch, the strategic challenge of Cop Mode, and the classic fun of Tag.
Here is the rewritten text:
Rockset executes semantic search within approximately 15 milliseconds, leveraging OpenAI-generated embeddings and combining vector search with metadata filtering to deliver faster and more relevant results.
This suggests that existing search strategies will need to adapt to accommodate emerging trends and new forms of content.
Using a vector search instance, we facilitate energy-efficient semantic searches, which can be particularly valuable in situations where rapid, relevant results are crucial.
Harness the power of vector search technology within your e-commerce website’s shopper functionality to uncover tailored recommendations by analyzing user behavior, such as previous purchases and webpage interactions. Vector search algorithms enable the creation of tailored product recommendations and personalized experiences by identifying patterns in customer preferences and similarities.
Enhance fraud detection by leveraging vector search techniques to identify anomalous transactions that deviate significantly from a robust repository of trustworthy transaction patterns and nuances. The company generates embeddings primarily based on features like transaction volume, geographic location, temporal context, and other relevant variables.
Deploy a vector search solution to facilitate analysis of analogous metrics such as engine temperature, oil pressure, and brake wear, enabling the assessment of relative vehicle health within a fleet. Through comparative evaluations of readings against established benchmarks from reliable sources, vector search algorithms can identify potential issues akin to those caused by a faulty engine or worn-out brakes, thereby facilitating timely maintenance and optimization.
As large language models become increasingly accessible, we anticipate a substantial surge in the utilization of unstructured data within the next few years, driven by the declining cost of generating embeddings. Rockset accelerates the convergence of real-time machine learning and analytics by simplifying the integration of vector search through its fully managed, cloud-agnostic service, enabling seamless querying and analysis of large-scale data sets in real-time.
Today’s search capabilities have become remarkably straightforward, eliminating the need for developers to design intricate, rule-based algorithms or manually customize text processing tools like tokenizers and analyzers. We uncover boundless opportunities for vector search; explore how Rockset can revolutionize your workflow with a tailored approach to your specific use case.
The Amazon Evaluate dataset originates from.
Jianmo Ni, Jiacheng Li, and Julian McAuley
Empirical Strategies for Pure Language Processing at EMNLP, 2019