Tuesday, February 4, 2025

Price Optimized Vector Database: Introduction to Amazon OpenSearch Service quantization methods

The rise of generative AI functions has heightened the need to implement semantic search and pure language search. These superior search options assist discover and retrieve conceptually related paperwork from enterprise content material repositories to function prompts for generative AI fashions. Uncooked knowledge inside varied supply repositories within the type of textual content, photos, audio, video, and so forth are transformed, with the assistance of embedding fashions, to a normal numerical illustration known as vectors that powers the semantic and pure language search. As organizations harness extra subtle giant language and foundational fashions to energy their generative AI functions, supplemental embedding fashions are additionally evolving to deal with giant, high-dimension vector embedding. Because the vector quantity expands, there’s a proportional improve in reminiscence utilization and computational necessities, leading to larger operational prices. To mitigate this challenge, varied compression methods can be utilized to optimize reminiscence utilization and computational effectivity.

Quantization is a lossy knowledge compression approach aimed to decrease computation and reminiscence utilization resulting in decrease prices, particularly for high-volume knowledge workloads. There are numerous methods to compress knowledge relying on the sort and quantity of the information. The standard approach is to map infinite values (or a comparatively giant checklist of finites) to smaller extra discrete values. Vector compression may be achieved by two major methods: product quantization and scalar quantization. Within the product quantization approach, the unique vector dimension array is damaged into a number of sub-vectors and every sub-vector is encoded into a hard and fast variety of bits that signify the unique vector. This methodology requires that you just solely retailer and search throughout the encoded sub-vector as a substitute of the unique vector. In scalar quantization, every dimension of the enter vector is mapped from a 32-bit floating-point illustration to a smaller knowledge kind.

Amazon OpenSearch Service, as a vector database, helps scalar and product quantization methods to optimize reminiscence utilization and scale back operational prices.

OpenSearch as a vector database

OpenSearch is a distributed search and analytics service. The OpenSearch k-nearest neighbor (k-NN) plugin means that you can index, retailer, and search vectors. Vectors are saved in OpenSearch as a 32-bit float array of kind knn_vector and that helps as much as 16,000 dimensions per vector.

OpenSearch makes use of approximate nearest neighbor search to offer scalable vector search. The approximate k-NN algorithm retrieves outcomes based mostly on an estimation of the closest vectors to a given question vector. Two most important strategies for performing approximate k-NN are the graph-based Hierarchical Navigable Small-World (HNSW) and the cluster-based Inverted File (IVF). These knowledge buildings are constructed and loaded into reminiscence in the course of the preliminary vector search operation. As vector quantity grows, each the information buildings and related reminiscence necessities for search operations scale proportionally.

For instance, every HNSW graph with 32-bit float knowledge takes roughly 1.1 * (4 * d + 8 * m) * num_vectors bytes of reminiscence. Right here, num_vectors represents the whole amount of vectors to be listed, d is the variety of dimensions decided by the embedding mannequin you employ to generate the vectors and m is the variety of edges within the HSNW graphs, an index parameter that may be managed to tune efficiency. Utilizing this system, reminiscence necessities for vector storage for a configuration of 384 dimensions and an m worth of 16 could be:

  • 1 million vectors: 1.830 GB (1.1 * (4 * 384 + 8 * 16) * 1000,000 bytes)
  • 1 billion vectors: 1830 GB (1.1 * (4 * 384 + 8 * 16) * 1,000,000,000 bytes)

Though approximate nearest neighbor search may be optimized to deal with huge datasets with billions of vectors effectively, the reminiscence necessities for loading 32-bit full-precision vectors to reminiscence in the course of the search course of can change into prohibitively expensive. To mitigate this, OpenSearch service helps the next 4 quantization methods.

  • Binary quantization
  • Byte quantization
  • FP16 quantization
  • Product quantization

These methods fall inside the broader class of scalar and product quantization that we mentioned earlier. On this submit, you’ll study quantization methods for optimizing vector workloads on OpenSearch Service, specializing in reminiscence discount and cost-efficiency. It introduces the brand new disk-based vector search method that permits environment friendly querying of vectors saved on disk with out loading them into reminiscence. The tactic integrates seamlessly with quantization methods, that includes key configurations such because the on_disk mode and compression_level parameter. These settings facilitate built-in, out-of-the-box scalar quantization on the time of indexing.

Binary quantization (as much as 32x compression)

Binary quantization (BQ) is a kind of scalar quantization. OpenSearch leverages FAISS engine’s binary quantization, enabling as much as 32x compression throughout indexing. This system reduces the vector dimension from the default 32-bit float to a 1-bit binary by compressing the vectors right into a 0s and 1s. OpenSearch helps indexing, storing and looking binary vectors. It’s also possible to select to encode every vector dimension utilizing 1, 2, or 4 bits, relying upon the specified compression issue as proven within the instance under. The compression issue may be adjusted utilizing bits settings. A price of two yields 16x compression, whereas 4 ends in 8x compression. The default setting is 1. In binary quantization, the coaching is dealt with natively on the time of indexing, permitting you to keep away from an extra preprocessing step.

To implement binary quantization, outline the vector kind as knn_vector and specify the encoder title as binary with the specified variety of encoding bits. Be aware, the encoder parameter refers to a technique used to compress vector knowledge earlier than storing it within the index. Optimize efficiency by utilizing space_type, m, and ef_construction parameters. See the OpenSearch documentation for details about the underlying configuration of the approximate k-NN.

PUT my-vector-index {   "settings": {     "index.knn": true   },   "mappings": {     "properties": {       "my_vector_field": {         "kind": "knn_vector",         "dimension": 8,         "methodology": {           "title": "hnsw",           "engine": "faiss",           "space_type": "l2",           "parameters": {             "m": 16,             "ef_construction": 512,             "encoder": {               "title": "binary",               "parameters": {                 "bits": 1               }             }           }         }       }     }   } }

Reminiscence necessities for implementing binary quantization with FAISS-HNSW:

1.1 * (bits * (d/8)+ 8 * m) * num_vectors bytes.

Compression Encoding bits

Reminiscence required for 1 billion vector

with d=384 and m=16 (in GB)

32x 1 193.6
16x 2 246.4
8x 4 352.0

For detailed implementation steps on binary quantization, see the OpenSearch documentation.

Byte-quantization (4x compression)

Byte quantization compresses 32-bit floating-point dimensions to 8-bit integers, starting from –128 to +127, lowering reminiscence utilization by 75%. OpenSearch helps indexing, storing, and looking byte vectors, which have to be transformed to 8-bit format previous to ingestion. To implement byte vectors, specify the k-NN vector subject data_type as byte within the index mapping. This characteristic is appropriate with each Lucene and FAISS engines. An instance of making an index for byte-quantized vectors follows.

PUT /my-vector-index {   "settings": {     "index": {       "knn": true,       "knn.algo_param.ef_search": 100     }   },   "mappings": {     "properties": {       "my_vector": {         "kind": "knn_vector",         "dimension": 3,         "data_type": "byte",         "space_type": "l2",         "methodology": {           "title": "hnsw",           "engine": "faiss",           "parameters": {             "ef_construction": 100,             "m": 16           }         }       }     }   } } 

This methodology requires ingesting a byte-quantized vector into OpenSearch for direct storage within the k-NN vector subject (of byte kind). Nevertheless, the lately launched disk-based vector search characteristic eliminates the necessity for exterior vector quantization. This characteristic might be mentioned intimately later on this weblog.

Reminiscence necessities for implementing byte quantization with FAISS-HNSW:

1.1 * (1 * d + 8 * m) * num_vectors bytes.

For detailed implementation steps, see to the OpenSearch documentation. For efficiency metrics relating to accuracy, throughput, and latency, see Byte-quantized vectors in OpenSearch.

FAISS FP16 quantization (2x compression)

FP16 quantization is a method that makes use of 16-bit floating-point scalar illustration, lowering the reminiscence utilization by 50%. Every vector dimension is transformed from 32-bit to 16-bit floating-point, successfully halving the reminiscence necessities. The compressed vector dimensions have to be within the vary [–65504.0, 65504.0]. To implement FP16 quantization, create the index with the k-NN vector subject and configure the next:

  • Set k-NN vector subject methodology and engine to HNSW and FAISS, respectively.
  • Outline encoder parameter and set title to sq and kind to fp16.

Upon importing 32-bit floating-point vectors to OpenSearch, the scalar quantization FP16 (SQfp16) routinely quantizes them to 16-bit floating-point vectors throughout ingestion and shops them within the vector subject. The next instance demonstrates the creation of the index for quantizing and storing FP16-quantized vectors.

PUT /my-vector-index {   "settings": {     "index": {       "knn": true,       "knn.algo_param.ef_search": 100     }   },   "mappings": {     "properties": {       "my_vector1": {         "kind": "knn_vector",         "dimension": 3,         "space_type": "l2",         "methodology": {           "title": "hnsw",           "engine": "faiss",           "parameters": {             "encoder": {               "title": "sq",               "parameters": {                 "kind": "fp16",                 "clip": true               }             },             "ef_construction": 256,             "m": 8           }         }       }     }   } } 

Reminiscence necessities for implementing FP16 quantization with FAISS-HNSW:

(1.1 * (2 * d + 8 * m) * num_vectors) bytes.

The previous FP16 instance introduces an optionally available Boolean parameter known as clip, which defaults to false. When false, vectors with out-of-range values (values not between –65504.0 and +65504.0) are rejected. Setting clip to true permits rounding of out-of-range vector values to suit inside the supported vary. For detailed implementation steps, see the OpenSearch documentation. For efficiency metrics relating to accuracy, throughput, and latency, see Optimizing OpenSearch with Faiss FP16 scalar quantization: Enhancing reminiscence effectivity and cost-effectiveness.

Product quantization

Product quantization (PQ) is a complicated dimension-reduction approach that provides considerably larger ranges of compression. Whereas typical scalar quantization strategies sometimes obtain as much as 32x compression, PQ can present compression ranges of as much as 64x, making it a extra environment friendly resolution for optimizing storage and price. OpenSearch helps PQ with each IVF and HNSW methodology from FAISS engine. Product quantization partitions vectors into m sub-vectors, every encoded with a bit rely decided by the code dimension. The ensuing vector’s reminiscence footprint is m * code_size bits.

FAISS product quantization entails three key steps:

  1. Create and populate a coaching index to construct the PQ mannequin, optimizing for accuracy.
  2. Execute the _train API on the coaching index to generate the quantizer mannequin.
  3. Assemble the vector index, configuring the kNN subject to make use of the ready quantizer mannequin.

The next instance demonstrates the three steps to establishing product quantization.

Step1: Create the coaching index. Populate the coaching index with an acceptable dataset, ensuring of dimensional alignment with train-index specs. Be aware that the coaching index requires a minimal of 256 paperwork.

PUT /train-index {   "settings": {     "number_of_shards": 1,     "number_of_replicas": 2   },   "mappings": {     "properties": {       "train-field": {         "kind": "knn_vector",         "dimension": 4       }     }   } }

Step2: Create a quantizer mannequin known as my-model by operating the _train API on the coaching index you simply created. Be aware that the encoder with title outlined as pq facilitates native vector quantization. Different parameters for encoder embrace code_size and m. FAISS-HNSW requires a code_size of 8 and a coaching dataset of a minimum of 256 (2^code_size) paperwork. For detailed parameter specs, see the PQ parameter reference.

POST /_plugins/_knn/fashions/my-model/_train {   "training_index": "train-index",   "training_field": "train-field",   "dimension": 4,   "description": "My take a look at mannequin description",   "methodology": {     "title": "hnsw",     "engine": "faiss",     "parameters": {       "encoder": {         "title": "pq",           "parameters": {            "code_size":8,            "m":2          }       },       "ef_construction": 256,       "m": 8     }   } }

Step3: Map the quantizer mannequin to your vector index.

PUT /my-vector-index {   "settings": {     "number_of_shards": 1,     "number_of_replicas": 2,     "index.knn": true   },   "mappings": {     "properties": {       "target-field": {         "kind": "knn_vector",         "model_id": "my-model"       }     }   } } 

Ingest the whole dataset into the newly created index, my-vector-index. The encoder will routinely course of the incoming vectors, making use of encoding and quantization based mostly on the compression parameters (code_size and m) specified within the quantizer mannequin configuration.

Reminiscence necessities for implementing product quantization with FAISS-HNSW:

1.1*(((code_size / 8) * m + 24 + 8 * m) * num_vectors bytes. Right here the code_size and m are parameters inside the encoder parameter, num_vectors are the whole variety of vectors.

Throughout quantization, every of the coaching vectors is damaged all the way down to a number of sub-vectors or sub-spaces, outlined by a configurable worth m. The variety of bits to encode every of the sub-vector is managed by parameter code_size. Every of the sub-vectors is then compressed or quantized individually by operating the k-means clustering with the worth ok outlined as 2^code_size. On this approach, the vector is compressed roughly by m * code_size bits.

For detailed implementation tips and understanding of the configurable parameters throughout product quantization, see the OpenSearch documentation. For efficiency metrics relating to accuracy, throughput and latency utilizing FAISS IVF for PQ, see Select the k-NN algorithm on your billion-scale use case with OpenSearch.

Disk-based vector search

Disk-based vector search optimizes question effectivity by utilizing compressed vectors in reminiscence whereas sustaining full-precision vectors on disk. This method permits OpenSearch to carry out searches throughout giant vector datasets with out the necessity to load total vectors into reminiscence, thus bettering scalability and useful resource utilization. Implementation is achieved by two new configurations at index creation: mode and compression degree. As of OpenSearch 2.17, the mode parameter may be set to both in_memory or on_disk throughout indexing. The beforehand mentioned strategies default to an in-memory mode. On this configuration, the vector index is constructed utilizing both a graph (HNSW) or bucket (IVF) construction, which is then loaded into native reminiscence throughout search operations. Whereas providing glorious recall, this method may impression reminiscence utilization, and scalability for top quantity vector workload.

The on_disk mode optimizes vector search effectivity by storing full-precision vectors on disk whereas utilizing real-time, native quantization throughout indexing. Coupled with adjustable compression ranges, this method permits solely compressed vectors to be loaded into reminiscence, thereby bettering reminiscence and useful resource utilization and search efficiency. The next compression ranges correspond to numerous scalar quantization strategies mentioned earlier.

  • 32x: Binary quantization (1-bit dimensions)
  • 4x: Byte and integer quantization (8-bit dimensions)
  • 2x: FP16 quantization (16-bit dimensions)

This methodology additionally helps different compression ranges comparable to 16x and 8x that aren’t accessible with the in-memory mode. To allow disk-based vector search, create the index with mode set to on_disk as proven within the following instance.

PUT /my-vector-index {   "settings" : {     "index": {       "knn": true     }   },   "mappings": {     "properties": {       "my_vector_field": {         "kind": "knn_vector",         "dimension": 8,         "space_type": "innerproduct",         "data_type": "float",         "mode": "on_disk"       }     }   } } 

Configuring simply the mode as on_disk employs the default configuration, which makes use of the FAISS engine and HNSW methodology with a 32x compression degree (1-bit, binary quantization). The ef_construction to optimize index time latency defaults to 100. For extra granular fine-tuning, you may override these k-NN parameters as proven within the instance that follows.

PUT /my-vector-index {   "settings" : {     "index": {       "knn": true     }   },   "mappings": {     "properties": {       "my_vector_field": {         "kind": "knn_vector",         "dimension": 8,         "space_type": "innerproduct",         "data_type": "float",         "mode": "on_disk",         "compression_level": "16x",         "methodology": {           "title": "hnsw",           "engine": "faiss",           "parameters": {             "ef_construction": 512           }         }       }     }   } }

As a result of quantization is a lossy compression approach, larger compression ranges sometimes end in decrease recall. To enhance recall throughout quantization, you may configure the disk-based vector search to run in two phases utilizing the search time configuration parameter ef_search and the oversample_factor as proven within the following instance.

GET my-vector-index/_search {   "question": {     "knn": {       "my_vector_field": {         "vector": [1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5],         "ok": 5,         "method_parameters": {             "ef_search": 512         },         "rescore": {             "oversample_factor": 10.0         }       }     }   } }

Within the first section, oversample_factor * ok outcomes are retrieved from the quantized vectors in reminiscence and the scores are approximated. Within the second section, the full-precision vectors of these oversample_factor * ok outcomes are loaded into reminiscence from disk, and scores are recomputed in opposition to the full-precision question vector. The outcomes are then diminished to the highest ok.

The oversample_factor for rescoring is decided by the configured dimension and compression degree at indexing. For dimensions under 1,000, the issue is mounted at 5. For dimensions exceeding 1,000, the default issue varies based mostly on the compression degree, as proven within the following desk.

Compression degree Default oversample_factor for rescoring
32x (default) 3
16x 2
8x 2
4x No default rescoring
2x No default rescoring

As beforehand mentioned, the oversample_factor may be dynamically adjusted at search time. This worth presents a vital trade-off between accuracy and search effectivity. Whereas the next issue improves accuracy, it proportionally will increase reminiscence utilization and reduces search throughput. See the OpenSearch documentation to study extra about disk-based vector search and perceive the correct utilization for oversample_factor.

Efficiency evaluation of quantization strategies: Reviewing reminiscence, recall, and question latency.

The OpenSearch documentation on approximate k-NN search supplies a place to begin for implementing vector similarity search. Moreover, Select the k-NN algorithm on your billion-scale use case with OpenSearch gives worthwhile insights into designing environment friendly vector workloads for dealing with billions of vectors in manufacturing environments. It introduces product quantization methods as a possible resolution to cut back reminiscence necessities and related prices by cutting down the reminiscence footprint.

The next desk illustrates the reminiscence necessities for storing and looking by 1 billion vectors utilizing varied quantization methods. The desk compares the default reminiscence consumption of full-precision vector utilizing the HNSW methodology in opposition to reminiscence consumed by quantized vectors. The mannequin employed on this evaluation is the sentence-transformers/all-MiniLM-L12-v2, which operates with 384 dimensions. The uncooked metadata is assumed to be no more than 100Gb.

With out quantization
(in GB)
Product quantization
(in GB)
Scalar quantization
(in GB)
FP16 vectors Byte vectors Binary vectors
m worth 16 16 16 16 16
pq_m, code_size 16, 8
Native reminiscence consumption (GB) 1830.4 184.8 985.6 563.2 193.6
Complete storage =
100 GB+vector
1930.4 284.8 1085.6 663.2 293.6

Reviewing the previous desk reveals that for a dataset comprising 1 billion vectors, the HNSW graph with 32-bit full-precision vector requires roughly 1830 GB of reminiscence. Compression methods comparable to product quantization can scale back this to 184.8 GB, whereas scalar quantization gives various ranges of compression. The next desk summarizes the correlation between compression methods and their impression on key efficiency indicators together with price financial savings, recall charge, and question latency. This evaluation builds upon our earlier evaluation of reminiscence utilization to assist in deciding on compression approach that meets your requirement.

The desk presents two key search metrics: search latency on the ninetieth percentile (p90) and recall at 100.

  • Search latency @p90 signifies that 90% of search queries might be accomplished inside that particular latency time.
  • recall@100 – The fraction of the highest 100 floor fact neighbors discovered within the 100 outcomes returned.
  With out quantization
(in GB)
Product quantization
(in GB)
Scalar quantization
(in GB)
  FP16 quantization
[mode=in_memory]
Byte quantization
[mode=in_memory]
Binary quantization
[mode=on_disk]
Preconditions/Datasets Relevant to all datasets Recall will depend on the character of the coaching knowledge Works for dimension worth in
vary [-65536 to 65535]
Works for dimension worth in
vary [-128 to 127]
Works effectively for bigger dimensions >=768
Preprocessing required? No Sure,
preprocessing/coaching is required
No No No
Rescoring No No No No Sure
Recall @100 >= 0.99 >0.7 >=0.95 >=0.95 >=0.90
p90 question latency (ms)
Price
(baseline $X)
$X $0.1*X
(as much as 90% financial savings)
$0.5*X
(as much as 50% financial savings)
$0.25*X
(as much as 75%)
$0.15*X
(as much as 85% financial savings)
Pattern price for a billion vector $20,923.14 $2,092.31 $10,461.57 $5,230.79 $3,138.47

The pattern price estimate for billion vector relies on a configuration optimized for price. Please word that precise financial savings might fluctuate based mostly in your particular workload necessities and chosen configuration parameters. Notably within the desk, product quantization gives as much as 90% price discount in comparison with the baseline HNSW graph-based vector search price ($X). Scalar quantization equally yields proportional price financial savings, starting from 50% to 85% relative to the compressed reminiscence footprint. The selection of compression approach entails balancing cost-effectiveness, accuracy, and efficiency, because it impacts precision and latency.

Conclusion

By leveraging OpenSearch’s quantization methods, organizations could make knowledgeable tradeoffs between price effectivity, efficiency, and recall, empowering them to fine-tune their vector database operations for optimum outcomes. These quantization methods considerably scale back reminiscence necessities, enhance question effectivity and provide built-in encoders for seamless compression. Whether or not you’re coping with large-scale textual content embeddings, picture options, or every other high-dimensional knowledge, OpenSearch’s quantization methods provide environment friendly options for vector search necessities, enabling the event of cost-effective, scalable, and high-performance programs.

As you progress ahead together with your vector database initiatives, we encourage you to:

  1. Discover OpenSearch’s compression methods in-depth
  2. Consider applicability of the correct approach to your particular use case
  3. Decide the suitable compression ranges based mostly in your necessities for recall and search latency
  4. Measure and examine price financial savings based mostly on accuracy, throughput, and latency

Keep knowledgeable in regards to the newest developments on this quickly evolving subject, and don’t hesitate to experiment with completely different quantization methods to seek out the optimum stability between price, efficiency, and accuracy on your functions.


In regards to the Authors

Aruna Govindaraju is an Amazon OpenSearch Specialist Options Architect and has labored with many industrial and open-source engines like google. She is obsessed with search, relevancy, and person expertise. Her experience with correlating end-user alerts with search engine conduct has helped many purchasers enhance their search expertise.

Vamshi Vijay Nakkirtha is a software program engineering supervisor engaged on the OpenSearch Mission and Amazon OpenSearch Service. His major pursuits embrace distributed programs. He’s an energetic contributor to numerous OpenSearch initiatives comparable to k-NN, Geospatial, and dashboard-maps.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles