object_api.ObjectIndex
vector_search.object_api.ObjectIndex(self, uri, config=None, timestamp=None, open_for_remote_query_execution=False, open_vector_index_for_remote_query_execution=False, load_embedding=True, load_metadata_in_memory=True, environment_variables={}, **kwargs)
An ObjectIndex represents a TileDB Vector Search index that is associated with a user-defined object reader and embedding function. This allows users to easily create and query TileDB Vector Search indexes that are backed by arbitrary data.
For example, an ObjectIndex can be used to create a TileDB Vector Search index that is backed by a collection of images. The object reader would be responsible for loading the images from disk, and the embedding function would be responsible for generating embeddings for the images.
Once the ObjectIndex is created, it can be queried using the query()
method. The query()
method takes a list of query objects and returns a list of the nearest neighbors for each query object.
The ObjectIndex class also provides methods for updating the index (update_index()
) and updating the object reader (update_object_reader()
).
Parameters
Name | Type | Description | Default |
---|---|---|---|
uri |
str | The URI of the index. | required |
config |
Optional[Mapping[str, Any]] | TileDB config dictionary. | None |
timestamp |
Timestamp to open the index at. | None |
|
load_embedding |
bool | Whether to load the embedding function into memory. | True |
open_for_remote_query_execution |
bool | If True , do not load the embedding model and any index data locally, and instead perform all query functionality in a TileDB Cloud taskgraph. |
False |
open_vector_index_for_remote_query_execution |
bool | If True , do not load any index data in main memory locally, and instead load index data and perform vector queries in a TileDB Cloud taskgraph. Compared to open_for_remote_query_execution , this loads the object embedding function and computes query object embeddings locally. |
False |
load_metadata_in_memory |
bool | Whether to load the metadata array into memory. | True |
environment_variables |
Dict | Environment variables to set for the object reader and embedding function. | {} |
**kwargs |
Keyword arguments to pass to the index constructor. | {} |
Methods
Name | Description |
---|---|
query | Queries the index and returns the nearest neighbors for each query object. |
update_index | Updates the index with new data. |
update_object_reader | Updates the object reader for the index. |
query
vector_search.object_api.ObjectIndex.query(query_objects, k, query_metadata=None, metadata_array_cond=None, metadata_df_filter_fn=None, return_objects=True, return_metadata=True, driver_mode=Mode.REALTIME, driver_resource_class=None, driver_resources=None, extra_driver_modules=None, driver_access_credentials_name=None, merge_results_result_pos_as_score=True, merge_results_reverse_dist=None, merge_results_per_query_embedding_group_fn=max, merge_results_per_query_group_fn=operator.add, **kwargs)
Queries the index and returns the nearest neighbors for each query object.
The query objects can be any type of object that is supported by the object reader. For example, if the object reader is configured to read images, then the query objects should be images.
The k
parameter specifies the number of nearest neighbors to return for each query object.
The query_metadata
parameter can be used to pass metadata for the query objects. This metadata will be passed to the embedding function, which can use it to generate embeddings for the query objects.
The metadata_array_cond
parameter can be used to filter the results of the query based on the metadata that is stored in the metadata array. This parameter should be a string that contains a valid TileDB query condition. For example, the following query condition could be used to filter the results to only include objects that have a color attribute that is equal to “red”:
="color='red'" metadata_array_cond
The metadata_df_filter_fn
parameter can be used to filter the results of the query based on the metadata that is stored in the metadata array. This parameter should be a function that takes a pandas DataFrame as input and returns a pandas DataFrame as output. The input DataFrame will contain the metadata for all of the objects that match the query, and the output DataFrame should contain the metadata for the objects that should be returned to the user.
The return_objects
parameter specifies whether to return the objects themselves, or just the object IDs. If this parameter is set to True
, then the query()
method will also return the objects instead of the object IDs.
The return_metadata
parameter specifies whether to return the metadata for the objects. If this parameter is set to True
, then the query()
method will also return the object metadata along with the distances and object IDs.
Parameters
Name | Type | Description | Default |
---|---|---|---|
query_objects |
np.ndarray | The query objects. | required |
k |
int | The number of nearest neighbors to return for each query object. | required |
query_metadata |
Optional[OrderedDict] | Metadata for the query objects. | None |
metadata_array_cond |
Optional[str] | A TileDB query condition that can be used to filter the results of the query based on the metadata that is stored in the metadata array. | None |
metadata_df_filter_fn |
Optional[str] | A function that can be used to filter the results of the query based on the metadata that is stored in the metadata array. | None |
return_objects |
bool | Whether to return the objects themselves, or just the object IDs. | True |
return_metadata |
bool | Whether to return the metadata for the objects. | True |
driver_mode |
Optional[Mode] | If not None , the query will be executed in a TileDB cloud taskgraph using the driver mode specified. |
Mode.REALTIME |
driver_resource_class |
Optional[str] | If driver_mode was REALTIME , the resources class (standard or large ) to use for the driver execution. |
None |
driver_resources |
Optional[Mapping[str, Any]] | If driver_mode was BATCH , the resources to use for the driver execution. Example {"cpu": "1", "memory": "4Gi"} |
None |
extra_driver_modules |
Optional[List[str]] | A list of extra Python modules to install on the driver node. | None |
driver_access_credentials_name |
Optional[str] | If driver_mode was not None , the access credentials name to use for the driver execution. |
None |
merge_results_result_pos_as_score |
bool | Applies only when there are multiple query embeddings per query. If True, each result score is based on the position of the result for the query embedding. | True |
merge_results_reverse_dist |
Optional[bool] | Applies only when there are multiple query embeddings per query. If True, the distances are reversed based on their reciprocal, (1 / dist). | None |
merge_results_per_query_embedding_group_fn |
Callable | Applies only when there are multiple query embeddings per query. Group function used to group together object scores per query embedding (i.e max, min, etc.). | max |
merge_results_per_query_group_fn |
Callable | Applies only when there are multiple query embeddings per query. Group function used to group together object scores per query (i.e add). This is applied after merge_results_per_query_embedding_group_fn |
operator.add |
**kwargs |
Keyword arguments to pass to the index query method. | {} |
Returns
Type | Description |
---|---|
Union[ | Tuple[np.ndarray, OrderedDict, Dict], Tuple[np.ndarray, OrderedDict], Tuple[np.ndarray, np.ndarray, Dict], Tuple[np.ndarray, np.ndarray], |
] | A tuple containing the distances, objects or object IDs, and optionally the object metadata. |
update_index
vector_search.object_api.ObjectIndex.update_index(index_timestamp=None, workers=-1, worker_resources=None, worker_image=None, extra_worker_modules=None, driver_resources=None, driver_image=None, extra_driver_modules=None, worker_access_credentials_name=None, max_tasks_per_stage=-1, verbose=False, trace_id=None, embeddings_generation_mode=Mode.LOCAL, embeddings_generation_driver_mode=Mode.LOCAL, vector_indexing_mode=Mode.LOCAL, config=None, namespace=None, environment_variables={}, use_updates_array=True, **kwargs)
Updates the index with new data.
This method can be used to update the index with new data. This is useful if the data that the index is built on has changed.
Update uses the ingest_embeddings_with_driver
function to add embeddings into a TileDB vector search index.
This function orchestrates the embedding ingestion process by creating and executing a TileDB Cloud DAG (Directed Acyclic Graph). The DAG consists of two main stages:
Embeddings Generation: This stage is responsible for computing embeddings for the objects to be indexed.
Vector Indexing: This stage is responsible for ingesting the generated embeddings into the TileDB vector search index.
Both stages can be be executed in one of three modes:
- LOCAL: Embeddings are ingested locally within the current process.
- REALTIME: Embeddings are ingested using a TileDB Cloud REALTIME TaskGraph.
- BATCH: Embeddings are ingested using a TileDB Cloud BATCH TaskGraph.
The ingest_embeddings_with_driver
function provides flexibility in configuring the execution environment for both stages as well as can run the full execution within a driver UDF. Users can specify the number of workers, resources, Docker images, and extra modules for both the driver and worker nodes.
The update_index()
method takes the following parameters:
Parameters
Name | Type | Description | Default |
---|---|---|---|
index_timestamp |
int | Timestamp to use for the update to take place at. | None |
workers |
int | The number of workers to use for the update. If this parameter is not specified, then the default number of workers will be used. | -1 |
worker_resources |
Dict | The resources to use for each worker. | None |
worker_image |
str | The Docker image to use for each worker. | None |
extra_worker_modules |
Optional[List[str]] | Extra modules to install on the worker nodes. | None |
driver_resources |
Dict | The resources to use for the driver. | None |
driver_image |
str | The Docker image to use for the driver. | None |
extra_driver_modules |
Optional[List[str]] | Extra modules to install on the driver node. | None |
worker_access_credentials_name |
str | The name of the TileDB Cloud access credentials to use for the workers. | None |
max_tasks_per_stage |
int | The maximum number of tasks to run per stage. | -1 |
verbose |
bool | Whether to print verbose output. | False |
trace_id |
Optional[str] | The trace ID to use for the update. | None |
embeddings_generation_mode |
Mode | The mode to use for generating embeddings. | Mode.LOCAL |
embeddings_generation_driver_mode |
Mode | The mode to use for the driver of the embeddings generation task. | Mode.LOCAL |
vector_indexing_mode |
Mode | The mode to use for indexing the vectors. | Mode.LOCAL |
config |
Optional[Mapping[str, Any]] | TileDB config dictionary. | None |
namespace |
Optional[str] | The TileDB Cloud namespace to use for the update. If this parameter is not specified, then the default namespace will be used. | None |
environment_variables |
Dict | Environment variables to set for the object reader and embedding function. | {} |
**kwargs |
Keyword arguments to pass to the ingestion function. | {} |
update_object_reader
vector_search.object_api.ObjectIndex.update_object_reader(object_reader, config=None)
Updates the object reader for the index.
This method can be used to update the object reader for the index. This is useful if the object reader needs to be updated to read objects from a different location, or if the object reader needs to be updated to read objects in a different format.
Parameters
Name | Type | Description | Default |
---|---|---|---|
object_reader |
ObjectReader | The new object reader. | required |
config |
Optional[Mapping[str, Any]] | TileDB config dictionary. | None |