index.Index
vector_search.index.Index(self, uri, open_for_remote_query_execution=False, config=None, timestamp=None, group=None)
Abstract Vector Index class. Do not use this directly but rather use the open
factory method.
All Vector Index algorithm implementations are instantiations of this class. Apart from the abstract method interfaces, Index
provides implementations for common tasks i.e. supporting updates, time-traveling and metadata management.
Opens an Index
reading metadata and applying time-traveling options.
Parameters
Name | Type | Description | Default |
---|---|---|---|
uri |
str | URI of the index. | required |
config |
Optional[Mapping[str, Any]] | TileDB config dictionary. | None |
timestamp |
If int, open the index at a given timestamp. If tuple, open at the given start and end timestamps. | None |
|
open_for_remote_query_execution |
bool | If True , do not load any index data in main memory locally, and instead load index data in the TileDB Cloud taskgraph created when a non-None driver_mode is passed to query() . If False , load index data in main memory locally. Note that you can still use a taskgraph for query execution, you’ll just end up loading the data both on your local machine and in the cloud taskgraph. |
False |
Methods
Name | Description |
---|---|
clear_history | Clears the history maintained in a Vector Index based on its URI. |
consolidate_updates | Consolidates updates by merging updates form the updates table into the base index. |
delete | Deletes a vector by its external_id . |
delete_batch | Deletes vectors by their external_ids . |
delete_index | Deletes an index from storage based on its URI. |
get_dimensions | Abstract method implemented by all Vector Index implementations. |
query | Queries an index with a set of query vectors, retrieving the k most similar vectors for each query. |
query_internal | Abstract method implemented by all Vector Index implementations. |
update | Updates a vector by its external_id . |
update_batch | Updates a set vectors by their external_ids . |
vacuum | The vacuuming process permanently deletes index files that are consolidated through the consolidation |
clear_history
vector_search.index.Index.clear_history(uri, timestamp, config=None)
Clears the history maintained in a Vector Index based on its URI.
This clears the update history before the provided timestamp
.
Use this in collaboration with consolidate_updates
to periodically cleanup update history.
Parameters
Name | Type | Description | Default |
---|---|---|---|
uri |
str | URI of the index. | required |
timestamp |
int | Clears update history before this timestamp. | required |
consolidate_updates
vector_search.index.Index.consolidate_updates(retrain_index=False, **kwargs)
Consolidates updates by merging updates form the updates table into the base index.
The consolidation process is used to avoid query latency degradation as more updates are added to the index. It triggers a base index re-indexing, merging the non-consolidated updates and the rest of the base vectors.
TODO(sc-51202): This throws with a unintuitive error message if update()/delete()/etc. has not been called.
Parameters
Name | Type | Description | Default |
---|---|---|---|
retrain_index |
bool | If true, retrain the index. If false, reuse data from the previous index. For IVF_FLAT retraining means we will recompute the centroids - when doing so you can pass any ingest() arguments used to configure computing centroids and we will use them when recomputing the centroids. Otherwise, if false, we will reuse the centroids from the previous index. | False |
**kwargs |
Extra kwargs passed here are passed to ingest function. |
{} |
delete
vector_search.index.Index.delete(external_id, timestamp=None)
Deletes a vector by its external_id
.
Parameters
Name | Type | Description | Default |
---|---|---|---|
external_id |
np.uint64 | External ID of the vector to be deleted. | required |
timestamp |
int | Timestamp to use for the deletes to take place at. | None |
delete_batch
vector_search.index.Index.delete_batch(external_ids, timestamp=None)
Deletes vectors by their external_ids
.
Parameters
Name | Type | Description | Default |
---|---|---|---|
external_ids |
np.array | External IDs of the vectors to be deleted. | required |
timestamp |
int | Timestamp to use for the deletes to take place at. | None |
delete_index
vector_search.index.Index.delete_index(uri, config=None)
Deletes an index from storage based on its URI.
Parameters
Name | Type | Description | Default |
---|---|---|---|
uri |
str | URI of the index. | required |
config |
Optional[Mapping[str, Any]] | TileDB config dictionary. | None |
get_dimensions
vector_search.index.Index.get_dimensions()
Abstract method implemented by all Vector Index implementations.
Returns the dimension of the vectors in the index.
query
vector_search.index.Index.query(queries, k, driver_mode=None, driver_resource_class=None, driver_resources=None, driver_access_credentials_name=None, **kwargs)
Queries an index with a set of query vectors, retrieving the k
most similar vectors for each query.
This provides an algorithm-agnostic implementation for updates:
- Queries the non-consolidated updates table.
- Calls the algorithm specific implementation of
query_internal
to query the base data. - Merges the results applying the updated data.
You can control where the query is executed by setting the driver_mode
parameter: - With driver_mode = None
, the driver logic for the query will be executed locally. - If driver_mode
is not None
, we will use a TileDB cloud taskgraph to re-open the index and run the query. With both options, certain implementations, i.e. IVF Flat, may let you create further TileDB taskgraphs as defined in the implementation specific query_internal
methods.
Parameters
Name | Type | Description | Default |
---|---|---|---|
queries |
np.ndarray | 2D array of query vectors. This can be used as a batch query interface by passing multiple queries in one call. | required |
k |
int | Number of results to return per query vector. | required |
driver_mode |
Optional[Mode] | If not None , the query will be executed in a TileDB cloud taskgraph using the driver mode specified. |
None |
driver_resource_class |
Optional[str] | If driver_mode was REALTIME , the resources class (standard or large ) to use for the driver execution. |
None |
driver_resources |
Optional[Mapping[str, Any]] | If driver_mode was BATCH , the resources to use for the driver execution. Example {"cpu": "1", "memory": "4Gi"} |
None |
driver_access_credentials_name |
Optional[str] | If driver_mode was not None , the access credentials name to use for the driver execution. |
None |
**kwargs |
Extra kwargs passed here are passed to the query_internal implementation of the concrete index class. |
{} |
query_internal
vector_search.index.Index.query_internal(queries, k, **kwargs)
Abstract method implemented by all Vector Index implementations.
Queries the base index with a set of query vectors, retrieving the k
most similar vectors for each query.
Parameters
Name | Type | Description | Default |
---|---|---|---|
queries |
np.ndarray | 2D array of query vectors. This can be used as a batch query interface by passing multiple queries in one call. | required |
k |
int | Number of results to return per query vector. | required |
**kwargs |
Extra kwargs passed here for each algorithm implementation. | {} |
update
vector_search.index.Index.update(vector, external_id, timestamp=None)
Updates a vector
by its external_id
.
This can be used to add new vectors or update an existing vector with the same external_id
.
Parameters
Name | Type | Description | Default |
---|---|---|---|
vector |
np.array | Vector data to be updated. | required |
external_id |
np.uint64 | External ID of the vector. | required |
timestamp |
int | Timestamp to use for the update to take place at. | None |
update_batch
vector_search.index.Index.update_batch(vectors, external_ids, timestamp=None)
Updates a set vectors
by their external_ids
.
This can be used to add new vectors or update existing vectors with the same external_id
.
Parameters
Name | Type | Description | Default |
---|---|---|---|
vectors |
np.ndarray | 2D array containing the vectors to be updated. | required |
external_ids |
np.array | External IDs of the vectors. | required |
timestamp |
int | Timestamp to use for the updates to take place at. | None |
vacuum
vector_search.index.Index.vacuum()
The vacuuming process permanently deletes index files that are consolidated through the consolidation process. TileDB separates consolidation from vacuuming, in order to make consolidation process-safe in the presence of concurrent reads and writes.
Note:
- Vacuuming is not process-safe and you should take extra care when invoking it.
- Vacuuming may affect the granularity of the time traveling functionality.
The Index class vacuums consolidated fragments of the updates
array.