object_api.ObjectIndex

vector_search.object_api.ObjectIndex(self, uri, config=None, timestamp=None, open_for_remote_query_execution=False, open_vector_index_for_remote_query_execution=False, load_embedding=True, load_metadata_in_memory=True, environment_variables={}, **kwargs)

An ObjectIndex represents a TileDB Vector Search index that is associated with a user-defined object reader and embedding function. This allows users to easily create and query TileDB Vector Search indexes that are backed by arbitrary data.

For example, an ObjectIndex can be used to create a TileDB Vector Search index that is backed by a collection of images. The object reader would be responsible for loading the images from disk, and the embedding function would be responsible for generating embeddings for the images.

Once the ObjectIndex is created, it can be queried using the query() method. The query() method takes a list of query objects and returns a list of the nearest neighbors for each query object.

The ObjectIndex class also provides methods for updating the index (update_index()) and updating the object reader (update_object_reader()).

Parameters

Name Type Description Default
uri str The URI of the index. required
config Optional[Mapping[str, Any]] TileDB config dictionary. None
timestamp Timestamp to open the index at. None
load_embedding bool Whether to load the embedding function into memory. True
open_for_remote_query_execution bool If True, do not load the embedding model and any index data locally, and instead perform all query functionality in a TileDB Cloud taskgraph. False
open_vector_index_for_remote_query_execution bool If True, do not load any index data in main memory locally, and instead load index data and perform vector queries in a TileDB Cloud taskgraph. Compared to open_for_remote_query_execution, this loads the object embedding function and computes query object embeddings locally. False
load_metadata_in_memory bool Whether to load the metadata array into memory. True
environment_variables Dict Environment variables to set for the object reader and embedding function. {}
**kwargs Keyword arguments to pass to the index constructor. {}

Methods

Name Description
query Queries the index and returns the nearest neighbors for each query object.
update_index Updates the index with new data.
update_object_reader Updates the object reader for the index.

query

vector_search.object_api.ObjectIndex.query(query_objects, k, query_metadata=None, metadata_array_cond=None, metadata_df_filter_fn=None, return_objects=True, return_metadata=True, driver_mode=Mode.REALTIME, driver_resource_class=None, driver_resources=None, extra_driver_modules=None, driver_access_credentials_name=None, merge_results_result_pos_as_score=True, merge_results_reverse_dist=None, merge_results_per_query_embedding_group_fn=max, merge_results_per_query_group_fn=operator.add, **kwargs)

Queries the index and returns the nearest neighbors for each query object.

The query objects can be any type of object that is supported by the object reader. For example, if the object reader is configured to read images, then the query objects should be images.

The k parameter specifies the number of nearest neighbors to return for each query object.

The query_metadata parameter can be used to pass metadata for the query objects. This metadata will be passed to the embedding function, which can use it to generate embeddings for the query objects.

The metadata_array_cond parameter can be used to filter the results of the query based on the metadata that is stored in the metadata array. This parameter should be a string that contains a valid TileDB query condition. For example, the following query condition could be used to filter the results to only include objects that have a color attribute that is equal to “red”:

metadata_array_cond="color='red'"

The metadata_df_filter_fn parameter can be used to filter the results of the query based on the metadata that is stored in the metadata array. This parameter should be a function that takes a pandas DataFrame as input and returns a pandas DataFrame as output. The input DataFrame will contain the metadata for all of the objects that match the query, and the output DataFrame should contain the metadata for the objects that should be returned to the user.

The return_objects parameter specifies whether to return the objects themselves, or just the object IDs. If this parameter is set to True, then the query() method will also return the objects instead of the object IDs.

The return_metadata parameter specifies whether to return the metadata for the objects. If this parameter is set to True, then the query() method will also return the object metadata along with the distances and object IDs.

Parameters

Name Type Description Default
query_objects np.ndarray The query objects. required
k int The number of nearest neighbors to return for each query object. required
query_metadata Optional[OrderedDict] Metadata for the query objects. None
metadata_array_cond Optional[str] A TileDB query condition that can be used to filter the results of the query based on the metadata that is stored in the metadata array. None
metadata_df_filter_fn Optional[str] A function that can be used to filter the results of the query based on the metadata that is stored in the metadata array. None
return_objects bool Whether to return the objects themselves, or just the object IDs. True
return_metadata bool Whether to return the metadata for the objects. True
driver_mode Optional[Mode] If not None, the query will be executed in a TileDB cloud taskgraph using the driver mode specified. Mode.REALTIME
driver_resource_class Optional[str] If driver_mode was REALTIME, the resources class (standard or large) to use for the driver execution. None
driver_resources Optional[Mapping[str, Any]] If driver_mode was BATCH, the resources to use for the driver execution. Example {"cpu": "1", "memory": "4Gi"} None
extra_driver_modules Optional[List[str]] A list of extra Python modules to install on the driver node. None
driver_access_credentials_name Optional[str] If driver_mode was not None, the access credentials name to use for the driver execution. None
merge_results_result_pos_as_score bool Applies only when there are multiple query embeddings per query. If True, each result score is based on the position of the result for the query embedding. True
merge_results_reverse_dist Optional[bool] Applies only when there are multiple query embeddings per query. If True, the distances are reversed based on their reciprocal, (1 / dist). None
merge_results_per_query_embedding_group_fn Callable Applies only when there are multiple query embeddings per query. Group function used to group together object scores per query embedding (i.e max, min, etc.). max
merge_results_per_query_group_fn Callable Applies only when there are multiple query embeddings per query. Group function used to group together object scores per query (i.e add). This is applied after merge_results_per_query_embedding_group_fn operator.add
**kwargs Keyword arguments to pass to the index query method. {}

Returns

Type Description
Union[ Tuple[np.ndarray, OrderedDict, Dict], Tuple[np.ndarray, OrderedDict], Tuple[np.ndarray, np.ndarray, Dict], Tuple[np.ndarray, np.ndarray],
] A tuple containing the distances, objects or object IDs, and optionally the object metadata.

update_index

vector_search.object_api.ObjectIndex.update_index(index_timestamp=None, workers=-1, worker_resources=None, worker_image=None, extra_worker_modules=None, driver_resources=None, driver_image=None, extra_driver_modules=None, worker_access_credentials_name=None, max_tasks_per_stage=-1, verbose=False, trace_id=None, embeddings_generation_mode=Mode.LOCAL, embeddings_generation_driver_mode=Mode.LOCAL, vector_indexing_mode=Mode.LOCAL, config=None, namespace=None, environment_variables={}, use_updates_array=True, **kwargs)

Updates the index with new data.

This method can be used to update the index with new data. This is useful if the data that the index is built on has changed.

Update uses the ingest_embeddings_with_driver function to add embeddings into a TileDB vector search index.

This function orchestrates the embedding ingestion process by creating and executing a TileDB Cloud DAG (Directed Acyclic Graph). The DAG consists of two main stages:

  1. Embeddings Generation: This stage is responsible for computing embeddings for the objects to be indexed.

  2. Vector Indexing: This stage is responsible for ingesting the generated embeddings into the TileDB vector search index.

Both stages can be be executed in one of three modes:

  • LOCAL: Embeddings are ingested locally within the current process.
  • REALTIME: Embeddings are ingested using a TileDB Cloud REALTIME TaskGraph.
  • BATCH: Embeddings are ingested using a TileDB Cloud BATCH TaskGraph.

The ingest_embeddings_with_driver function provides flexibility in configuring the execution environment for both stages as well as can run the full execution within a driver UDF. Users can specify the number of workers, resources, Docker images, and extra modules for both the driver and worker nodes.

The update_index() method takes the following parameters:

Parameters

Name Type Description Default
index_timestamp int Timestamp to use for the update to take place at. None
workers int The number of workers to use for the update. If this parameter is not specified, then the default number of workers will be used. -1
worker_resources Dict The resources to use for each worker. None
worker_image str The Docker image to use for each worker. None
extra_worker_modules Optional[List[str]] Extra modules to install on the worker nodes. None
driver_resources Dict The resources to use for the driver. None
driver_image str The Docker image to use for the driver. None
extra_driver_modules Optional[List[str]] Extra modules to install on the driver node. None
worker_access_credentials_name str The name of the TileDB Cloud access credentials to use for the workers. None
max_tasks_per_stage int The maximum number of tasks to run per stage. -1
verbose bool Whether to print verbose output. False
trace_id Optional[str] The trace ID to use for the update. None
embeddings_generation_mode Mode The mode to use for generating embeddings. Mode.LOCAL
embeddings_generation_driver_mode Mode The mode to use for the driver of the embeddings generation task. Mode.LOCAL
vector_indexing_mode Mode The mode to use for indexing the vectors. Mode.LOCAL
config Optional[Mapping[str, Any]] TileDB config dictionary. None
namespace Optional[str] The TileDB Cloud namespace to use for the update. If this parameter is not specified, then the default namespace will be used. None
environment_variables Dict Environment variables to set for the object reader and embedding function. {}
**kwargs Keyword arguments to pass to the ingestion function. {}

update_object_reader

vector_search.object_api.ObjectIndex.update_object_reader(object_reader, config=None)

Updates the object reader for the index.

This method can be used to update the object reader for the index. This is useful if the object reader needs to be updated to read objects from a different location, or if the object reader needs to be updated to read objects in a different format.

Parameters

Name Type Description Default
object_reader ObjectReader The new object reader. required
config Optional[Mapping[str, Any]] TileDB config dictionary. None