Organizations we’ve engaged with are currently exploring the potential of AI-powered personalization, suggesting, semantic search, and anomaly detection. Recent breakthroughs in the accuracy and accessibility of massive language models (LLMs), particularly those leveraging BERT and OpenAI, have compelled companies to reassess how they build integrated search and analytics experiences.
This blog post features insights from five pioneering companies – Pinterest, Spotify, eBay, Airbnb, and DoorDash – that have successfully integrated artificial intelligence (AI) into their services. Here is the rewritten text:
We envision that these stories will prove valuable for engineering teams contemplating the comprehensive journey of vector search, encompassing every stage from generating embeddings to deploying solutions.
What’s vector search?
Vector search enables efficient retrieval of similar objects from large datasets by leveraging representations in a high-dimensional space, thereby facilitating the discovery of comparable data. In this context, objects may encompass a broad range of entities, akin to documents, images, or audio files, which are represented through vector embeddings. Similarity between objects is calculated using distance metrics akin to cosine or Minkowski similarities, which measure the proximity of two vector embeddings.
The vector search course typically encompasses:
- Producing embeddings: Relevant features are extracted from raw data to generate vector representations using models such as Word2Vec, GloVe, or BERT.
- Indexing: By organizing the vector embeddings into a structured information architecture, we enable efficient search capabilities through the application of algorithms such as
- Vector Search: Leveraging the concept of most similar objects to a query vector, vector search algorithms retrieve top matches by utilizing a chosen distance metric, such as cosine similarity or Euclidean distance.
To better visualize vector searches, consider a three-dimensional space where each axis represents a characteristic. The location and duration of a specific scope within a particular region are determined by the selection of relevant choices. In this zone, analogous items are clustered closer together, while disparate elements occupy greater distances from one another.
Given a query, we are able to identify the most analogous objects within the dataset. The inquiry is projected onto a shared vector embedding space alongside the product embeddings, subsequently computing the distance between the question embedding and each product embedding. Merchandise embeddings with the shortest distance to the question embedding are thus considered the most analogous.
In its most fundamental form, this visualization represents a reduced representation of the complex processes involved in vector search algorithms operating within high-dimensional spaces.
This article will synthesize insights from five prominent engineering blogs on vector search, with a focus on highlighting critical implementation challenges. Total engineering blogs may potentially be found below.
Pinterest: Curiosity search and discovery
Pinterest utilizes advanced image recognition technology to facilitate seamless picture search and discovery across various facets of its platform, including the home feed, related pins, and search functionality powered by a cutting-edge multitask learning model.
A multiskilled mannequin excels at executing numerous tasks in parallel, leveraging shared underlying representations or architectures that can boost generalisation and efficiency across related tasks. On Pinterest, a team leveraged a consistent model to push high-quality content onto the home feed, relevant pins, and search results.
Pinterest trains the model by linking a customer’s search query (q) to the relevant content they engaged with, such as the pins they clicked on or saved (p). Pinterest developed a sophisticated algorithm to generate precise (q,p) pairings for each activity by leveraging natural language processing and machine learning techniques.
- Associated Pins: The phrase embeddings are generated from a combination of the user’s query (q) and their interaction with specific pins (p), specifically those they’ve clicked on or saved.
- Phrase embeddings are generated from the search query’s textual content (q) and the relevant item clicked or saved by the user (p).
- Homefeed: The primary factors influencing the generation of phrase embeddings include the user’s curiosity (q) and their interactions with content, specifically the pins they click on or save (p).
To obtain a comprehensive entity representation, Pinterest computes the mean of phrase embeddings corresponding to relevant Pins across search results and the Home feed.
Pinterest developed and assessed the performance of its novel personal supervised Pintext-MTL model against unsupervised learning models such as GloVe, word2vec, and a single-task learning model, PinText-SR, in terms of precision. PintText-MTL demonstrated greater precision compared to opposing embedding methods, indicating a higher percentage of accurate positive predictions among all positive predictions.
Pinterest found that users who employed multitasking learning strategies exhibited higher recall rates, accurately identifying a larger proportion of relevant instances, thereby making them a more effective match for search and discovery.
By integrating its algorithm across multiple platforms, Pinterest enables seamless data streaming from its home feed, search, and related pin interfaces in manufacturing settings. Once the mannequin is adequately trained, large-scale vector embeddings are generated through a batch processing job, leveraging either a Kubernetes-Docker combination or a scalable MapReduce framework. The platform constructs a comprehensive search index based on vector embeddings and leverages an Okay-Nearest Neighbors algorithm to efficiently identify the most relevant content for users, thereby providing personalized results. Caching outcomes enables Pinterest to meet its efficiency requirements.
Spotify: Podcast search
Spotify leverages key phrases and natural language processing to deliver relevant podcast episode suggestions to users. Despite the presence of relevant podcast episodes on Spotify, the crew encountered limitations in querying “electrical vehicles’ weather influence”, resulting in a surprising zero hits, despite the existence of applicable content. To boost recall, the Spotify team leveraged Approximate Nearest Neighbor (ANN) technology for rapid, relevant podcast discovery.
Utilizing its multilingual capabilities, the crew generates vector embeddings that support a vast world library of podcasts, producing high-quality results. Various fashion evaluations have also been conducted along with training a model on a large corpus of text data; however, it was found that BERT excelled at phrase embeddings over sentence embeddings, and its pre-training was limited to the English language only?
Spotify constructs vector representations by combining the input embedding from user queries with a concatenated representation of textual metadata, including titles and summaries, to create podcast episode embeddings. Spotify calculated the cosine similarity between the question and episode vectors to determine their degree of similarity.
Spotify employed constructive pairs of profitable podcast searches and episodes to train its bottom-up Common Sense Encoder CMLM model. They incorporated in-batch negative samples, as described in relevant research papers, by combining them with random positive examples to create adversarial pairing instances. Further testing was conducted using both artificial queries and manually crafted queries.
To integrate vector search into suggesting podcasts based on users’ preferences within the manufacturing industry, Spotify employed innovative approaches and cutting-edge technologies.
- Spotify indexes episode vectors offline in batches using a search engine that natively supports Artificial Neural Networks (ANN). The selection of Vespa as one potential solution is partly due to its ability to apply metadata filtering after search results, allowing for further refinement by factors such as episode reputation.
- Online inference: Spotify leverages language models to generate a query vector. Vertex AI was selected for its assistance in accelerating GPU-based inference, particularly with large-scale transformer models used to produce embeddings at a lower cost, as well as its innovative question caching feature. Following the generation of the question vector embedding, it facilitates retrieval of the top 30 podcast episodes from the Vespa index.
While semantic search effectively identifies relevant podcast episodes, its limitations mean that traditional keyword search remains a valuable complementary tool. This limitation arises from the fact that semantic search struggles to accurately match precise time periods when customers search for an exact episode or podcast title. Spotify leverages a cutting-edge search approach that combines the power of semantic search through Vespa with keyphrase search, followed by a decisive re-ranking phase to ensure users are presented with relevant episode suggestions.
eBay: Picture search
Traditionally, search engine results pages (SERPs) have showcased relevant information by matching search queries with descriptive summaries of documents or items. This technique relies heavily on linguistic cues to infer preferences, but its effectiveness diminishes when applied to aspects of fashion or aesthetics, where other factors come into play. eBay launches a feature designed to help users find relevant, comparable items that match their search query.
EBay leverages a sophisticated multimodal architecture that seamlessly integrates data from diverse sources, including text, images, audio, and video, to generate accurate predictions and execute tasks efficiently. On eBay, the platform combines visual and textual data by feeding images and descriptions into a model. A Convolutional Neural Network (CNN) generates image embeddings, while a text-based model produces title embeddings. The itemizing vectors are constructed through a combination of image and title representations, where each is embedded into a shared latent space.
Once the multimodal manikin has been trained on a substantial dataset of image-title pairs and recently published listings, it’s ready for deployment in the website search experience. To accommodate the vast array of listings on eBay, data is processed in bulk and stored in HDFS, the company’s central repository for information. eBay leverages Apache Spark to collect and store image data alongside relevant metadata for further processing, as well as generating listing embeddings. The itemized embeddings are printed to a column-store database akin to HBase, ideal for aggregating and processing large-scale data sets. Cassini, a search engine developed by eBay, leverages HBase to serve and list itemized embeddings.
The pipeline is managed using Apache Airflow, capable of handling large volumes and complex workflows with ease. This comprehensive platform further extends its support to include Spark, Hadoop, and Python, effectively empowering machine learning teams to seamlessly adopt and maximize their capabilities.
Customers can discover analogous products and tastes through visible search, a feature that allows them to explore categories of furniture and home decor where style and aesthetics play a crucial role in purchasing decisions. In the near future, eBay intends to enhance its visual search capabilities across all product categories, enabling customers to easily discover complementary items and curate a cohesive look and feel throughout their home.
AirBnb: Actual-time customized listings
Search and comparable listings options are responsible for driving 99% of bookings on the Airbnb website. Airbnb developed an algorithm to improve comparable listing suggestions and provide real-time personalization in search results.
By recognizing the potential for applications beyond straightforward phrase representations, Airbnb pioneersed embedding innovations to incorporate user behavior, click patterns, and booking data in tandem.
To improve their embedding models, Airbnb leveraged more than 4.5 million active listings and 800 million search queries to identify similarities based on users’ click-through behavior within a single session. Listings with identical engagement from the same user within a session are clustered together, while those with low or no interaction are progressively relegated to the periphery. The team ultimately decided on a 32-dimensional inventory embedding, weighing the benefits of offline efficiency against the need for effective recall in online serving.
Airbnb found that certain listing attributes don’t necessitate manual input, as they can be readily extracted from metadata, much like property values. While attributes such as structure, model, and ambiance may be more challenging to extract from metadata alone.
Before scaling up production, Airbnb refined its model by verifying its accuracy through experiments that demonstrated how well-recommended listings matched actual bookings made by users. The team also conducted an A/B test comparing the existing listings algorithm to one based on vector embeddings. The team found that the algorithm incorporating vector embeddings led to a significant 21% increase in click-through rates (CTR) and a notable 4.9% boost in customers finding available inventory they had previously reserved.
The team also discovered that vector embeddings could be leveraged as a component within their model for delivering real-time personalized search results. To track individual user behavior, they aggregated and stored real-time data on clicks and skips over the past fortnight using Kafka to maintain a concise, short-term history. Each time a user initiates a query, their system automatically runs two iterative similarity searches.
- Primarily focused on the geographic markets that have recently been searched,
- What are the key factors in common among candidate profiles matched with those the user has interacted with?
Embeddings have been extensively evaluated in both offline and online experiments, solidifying their role as a core component of real-time personalization strategies.
Doordash: Customized retailer feeds
DoorDash features a diverse range of partner shops, allowing customers to browse and select their preferred options. By enabling users to filter results based on personalized preferences, the platform enhances search functionality and facilitates seamless discovery.
DoorDash aimed to harness the power of retailer feed algorithms through the application of vector embeddings. By leveraging this feature, DoorDash can uncover previously unknown patterns and correlations between seemingly disparate shops, thereby gaining valuable insights on aspects such as product offerings (e.g., presence of candies), aesthetic appeal (style), and dietary considerations (vegetarian options).
DoorDash leveraged Store2Vec, a customized version of Word2Vec, a widely-used natural language processing tool, to analyze existing data. The team treated each retailer uniformly, constructing coherent sentences by leveraging the shop history observed within a single user session, capping it at five stores per sentence. To generate person-centric vector representations, DoorDash aggregated the embeddings of shops from which customers placed orders over the past six months or up to a maximum of 100 orders.
DoorDash leverages vector search technology to recommend similar dining establishments based on users’ recent orders from trendy eateries like 4505 Burgers and New Nagano Sushi in San Francisco, tailoring the suggestions to their exact tastes. DoorDash generates a list of comparable eateries by calculating the cosine distance between the individual’s persona embedding and restaurant embeddings in a shared space, ranking establishments based on their proximity to the user’s preferences. As observed, the establishments most proximal in terms of cosine distance include Kezar Pub and Picket Charcoal Korean Village BBQ.
DoorDash incorporated Store2Vec distance characteristics as just one of numerous options within its larger suggestion and personalization framework. Vector search enabled DoorDash to boost its click-through-rate by a notable 5%. While the crew is experimenting with innovative styles, they might consider refining their approach by integrating cutting-edge techniques such as optimizing mannequins and leveraging customer-provided feedback in real-time to create a truly immersive shopping experience.
Key concerns for vector search
Pinterest, Spotify, eBay, Airbnb, and DoorDash elevate their users’ search and discovery journeys by harnessing the power of vector search technology. Several groups initially employed textual content search, only to discover the drawbacks of fuzzy searching or searches restricted by specific styles or preferences. When applied in such contexts, incorporating vector search into one’s expertise streamlines the process of discovering relevant and occasionally tailored podcasts, as well as restaurants, pillows, pins, and leases.
Corporations often make several strategic choices that are worthy of scrutiny when incorporating vector search capabilities.
- Many individuals started by utilising an off-the-shelf model and then customised it with their unique data. Moreover, they recognised that language trends such as word2vec can potentially be leveraged by exchanging phrases and their descriptions with objects and analogous objects that have been recently interacted with. Companies such as Airbnb found that leveraging derivatives of linguistic patterns, rather than visual patterns, can still effectively capture visual similarities and differences.
- By leveraging vast amounts of historical data and machine learning capabilities, many corporations have chosen to refine their forecasting approaches through data-driven coaching processes.
-
By contrast, while many corporations leveraged ANN search, Pinterest was uniquely positioned to combine metadata filtering with K-Nearest Neighbor (KNN) search for scalability and effectiveness.
- Hybrid searches often augment traditional textual content searches rather than replacing them entirely. On various instances, such as with Spotify’s implementation, a ranking algorithm is employed to determine which outcome – either that generated by vector search or textual content search – is most pertinent.
- While some organizations leverage batch-based approaches to generate vector embeddings, a major caveat is that these representations often remain static and infrequently updated. They leverage a unique architecture, frequently employing Elasticsearch, to calculate the question vector embedding in real-time while incorporating relevant metadata for enhanced search capabilities.
Rockset, a real-time search and analytics database, has recently incorporated support for. Experience real-time personalization, suggestions, and anomaly detection with vector search on Rockset, starting with a generous $300 credit today!