object_api.ObjectIndex

vector_search.object_api.ObjectIndex(self, uri, config=None, timestamp=None, open_for_remote_query_execution=False, open_vector_index_for_remote_query_execution=False, load_embedding=True, load_metadata_in_memory=True, environment_variables={}, **kwargs)

An ObjectIndex represents a TileDB Vector Search index that is associated with a user-defined object reader and embedding function. This allows users to easily create and query TileDB Vector Search indexes that are backed by arbitrary data.

For example, an ObjectIndex can be used to create a TileDB Vector Search index that is backed by a collection of images. The object reader would be responsible for loading the images from disk, and the embedding function would be responsible for generating embeddings for the images.

Once the ObjectIndex is created, it can be queried using the query() method. The query() method takes a list of query objects and returns a list of the nearest neighbors for each query object.

The ObjectIndex class also provides methods for updating the index (update_index()) and updating the object reader (update_object_reader()).

Parameters

Name	Type	Description	Default
`uri`	str	The URI of the index.	required
`config`	Optional[Mapping[str, Any]]	TileDB config dictionary.	`None`
`timestamp`		Timestamp to open the index at.	`None`
`load_embedding`	bool	Whether to load the embedding function into memory.	`True`
`open_for_remote_query_execution`	bool	If `True`, do not load the embedding model and any index data locally, and instead perform all query functionality in a TileDB Cloud taskgraph.	`False`
`open_vector_index_for_remote_query_execution`	bool	If `True`, do not load any index data in main memory locally, and instead load index data and perform vector queries in a TileDB Cloud taskgraph. Compared to `open_for_remote_query_execution`, this loads the object embedding function and computes query object embeddings locally.	`False`
`load_metadata_in_memory`	bool	Whether to load the metadata array into memory.	`True`
`environment_variables`	Dict	Environment variables to set for the object reader and embedding function.	`{}`
`**kwargs`		Keyword arguments to pass to the index constructor.	`{}`

Methods

Name	Description
query	Queries the index and returns the nearest neighbors for each query object.
update_index	Updates the index with new data.
update_object_reader	Updates the object reader for the index.

query

vector_search.object_api.ObjectIndex.query(query_objects, k, query_metadata=None, metadata_array_cond=None, metadata_df_filter_fn=None, return_objects=True, return_metadata=True, driver_mode=Mode.REALTIME, driver_resource_class=None, driver_resources=None, extra_driver_modules=None, driver_access_credentials_name=None, merge_results_result_pos_as_score=True, merge_results_reverse_dist=None, merge_results_per_query_embedding_group_fn=max, merge_results_per_query_group_fn=operator.add, **kwargs)

Queries the index and returns the nearest neighbors for each query object.

The query objects can be any type of object that is supported by the object reader. For example, if the object reader is configured to read images, then the query objects should be images.

The k parameter specifies the number of nearest neighbors to return for each query object.

The query_metadata parameter can be used to pass metadata for the query objects. This metadata will be passed to the embedding function, which can use it to generate embeddings for the query objects.

The metadata_array_cond parameter can be used to filter the results of the query based on the metadata that is stored in the metadata array. This parameter should be a string that contains a valid TileDB query condition. For example, the following query condition could be used to filter the results to only include objects that have a color attribute that is equal to “red”:

metadata_array_cond="color='red'"

The metadata_df_filter_fn parameter can be used to filter the results of the query based on the metadata that is stored in the metadata array. This parameter should be a function that takes a pandas DataFrame as input and returns a pandas DataFrame as output. The input DataFrame will contain the metadata for all of the objects that match the query, and the output DataFrame should contain the metadata for the objects that should be returned to the user.

The return_objects parameter specifies whether to return the objects themselves, or just the object IDs. If this parameter is set to True, then the query() method will also return the objects instead of the object IDs.

The return_metadata parameter specifies whether to return the metadata for the objects. If this parameter is set to True, then the query() method will also return the object metadata along with the distances and object IDs.

Parameters

Name	Type	Description	Default
`query_objects`	np.ndarray	The query objects.	required
`k`	int	The number of nearest neighbors to return for each query object.	required
`query_metadata`	Optional[OrderedDict]	Metadata for the query objects.	`None`
`metadata_array_cond`	Optional[str]	A TileDB query condition that can be used to filter the results of the query based on the metadata that is stored in the metadata array.	`None`
`metadata_df_filter_fn`	Optional[str]	A function that can be used to filter the results of the query based on the metadata that is stored in the metadata array.	`None`
`return_objects`	bool	Whether to return the objects themselves, or just the object IDs.	`True`
`return_metadata`	bool	Whether to return the metadata for the objects.	`True`
`driver_mode`	Optional[Mode]	If not `None`, the query will be executed in a TileDB cloud taskgraph using the driver mode specified.	`Mode.REALTIME`
`driver_resource_class`	Optional[str]	If `driver_mode` was `REALTIME`, the resources class (`standard` or `large`) to use for the driver execution.	`None`
`driver_resources`	Optional[Mapping[str, Any]]	If `driver_mode` was `BATCH`, the resources to use for the driver execution. Example `{"cpu": "1", "memory": "4Gi"}`	`None`
`extra_driver_modules`	Optional[List[str]]	A list of extra Python modules to install on the driver node.	`None`
`driver_access_credentials_name`	Optional[str]	If `driver_mode` was not `None`, the access credentials name to use for the driver execution.	`None`
`merge_results_result_pos_as_score`	bool	Applies only when there are multiple query embeddings per query. If True, each result score is based on the position of the result for the query embedding.	`True`
`merge_results_reverse_dist`	Optional[bool]	Applies only when there are multiple query embeddings per query. If True, the distances are reversed based on their reciprocal, (1 / dist).	`None`
`merge_results_per_query_embedding_group_fn`	Callable	Applies only when there are multiple query embeddings per query. Group function used to group together object scores per query embedding (i.e max, min, etc.).	`max`
`merge_results_per_query_group_fn`	Callable	Applies only when there are multiple query embeddings per query. Group function used to group together object scores per query (i.e add). This is applied after `merge_results_per_query_embedding_group_fn`	`operator.add`
`**kwargs`		Keyword arguments to pass to the index query method.	`{}`

Returns

Type	Description
Union[	Tuple[np.ndarray, OrderedDict, Dict], Tuple[np.ndarray, OrderedDict], Tuple[np.ndarray, np.ndarray, Dict], Tuple[np.ndarray, np.ndarray],
]	A tuple containing the distances, objects or object IDs, and optionally the object metadata.

update_index

vector_search.object_api.ObjectIndex.update_index(index_timestamp=None, workers=-1, worker_resources=None, worker_image=None, extra_worker_modules=None, driver_resources=None, driver_image=None, extra_driver_modules=None, worker_access_credentials_name=None, max_tasks_per_stage=-1, verbose=False, trace_id=None, embeddings_generation_mode=Mode.LOCAL, embeddings_generation_driver_mode=Mode.LOCAL, vector_indexing_mode=Mode.LOCAL, config=None, namespace=None, environment_variables={}, use_updates_array=True, **kwargs)

Updates the index with new data.

This method can be used to update the index with new data. This is useful if the data that the index is built on has changed.

Update uses the ingest_embeddings_with_driver function to add embeddings into a TileDB vector search index.

This function orchestrates the embedding ingestion process by creating and executing a TileDB Cloud DAG (Directed Acyclic Graph). The DAG consists of two main stages:

Embeddings Generation: This stage is responsible for computing embeddings for the objects to be indexed.
Vector Indexing: This stage is responsible for ingesting the generated embeddings into the TileDB vector search index.

Both stages can be be executed in one of three modes:

LOCAL: Embeddings are ingested locally within the current process.
REALTIME: Embeddings are ingested using a TileDB Cloud REALTIME TaskGraph.
BATCH: Embeddings are ingested using a TileDB Cloud BATCH TaskGraph.

The ingest_embeddings_with_driver function provides flexibility in configuring the execution environment for both stages as well as can run the full execution within a driver UDF. Users can specify the number of workers, resources, Docker images, and extra modules for both the driver and worker nodes.

The update_index() method takes the following parameters:

Parameters

Name	Type	Description	Default
`index_timestamp`	int	Timestamp to use for the update to take place at.	`None`
`workers`	int	The number of workers to use for the update. If this parameter is not specified, then the default number of workers will be used.	`-1`
`worker_resources`	Dict	The resources to use for each worker.	`None`
`worker_image`	str	The Docker image to use for each worker.	`None`
`extra_worker_modules`	Optional[List[str]]	Extra modules to install on the worker nodes.	`None`
`driver_resources`	Dict	The resources to use for the driver.	`None`
`driver_image`	str	The Docker image to use for the driver.	`None`
`extra_driver_modules`	Optional[List[str]]	Extra modules to install on the driver node.	`None`
`worker_access_credentials_name`	str	The name of the TileDB Cloud access credentials to use for the workers.	`None`
`max_tasks_per_stage`	int	The maximum number of tasks to run per stage.	`-1`
`verbose`	bool	Whether to print verbose output.	`False`
`trace_id`	Optional[str]	The trace ID to use for the update.	`None`
`embeddings_generation_mode`	Mode	The mode to use for generating embeddings.	`Mode.LOCAL`
`embeddings_generation_driver_mode`	Mode	The mode to use for the driver of the embeddings generation task.	`Mode.LOCAL`
`vector_indexing_mode`	Mode	The mode to use for indexing the vectors.	`Mode.LOCAL`
`config`	Optional[Mapping[str, Any]]	TileDB config dictionary.	`None`
`namespace`	Optional[str]	The TileDB Cloud namespace to use for the update. If this parameter is not specified, then the default namespace will be used.	`None`
`environment_variables`	Dict	Environment variables to set for the object reader and embedding function.	`{}`
`**kwargs`		Keyword arguments to pass to the ingestion function.	`{}`

update_object_reader

vector_search.object_api.ObjectIndex.update_object_reader(object_reader, config=None)

Updates the object reader for the index.

This method can be used to update the object reader for the index. This is useful if the object reader needs to be updated to read objects from a different location, or if the object reader needs to be updated to read objects in a different format.

Parameters

Name	Type	Description	Default
`object_reader`	ObjectReader	The new object reader.	required
`config`	Optional[Mapping[str, Any]]	TileDB config dictionary.	`None`