ingestion

vector_search.ingestion

Vector Search ingestion Utilities

This contains the ingestion implementation for different TileDB Vector Search algorithms.

It enables:

Local ingestion:
- Multi-threaded execution that can leverage all the available local computing resources.
Distributed ingestion:
- Distributed ingestion execution with multiple workers in TileDB Cloud. This can be used to ingest large datasets and speedup ingestion latency.

Functions

Name	Description
ingest	Ingest vectors into TileDB.

ingest

vector_search.ingestion.ingest(index_type, index_uri, *, input_vectors=None, source_uri=None, source_type=None, external_ids=None, external_ids_uri='', external_ids_type=None, updates_uri=None, index_timestamp=None, config=None, namespace=None, size=-1, partitions=-1, num_subspaces=-1, l_build=-1, r_max_degree=-1, training_sampling_policy=TrainingSamplingPolicy.FIRST_N, copy_centroids_uri=None, training_sample_size=-1, training_input_vectors=None, training_source_uri=None, training_source_type=None, workers=-1, input_vectors_per_work_item=-1, max_tasks_per_stage=-1, input_vectors_per_work_item_during_sampling=-1, max_sampling_tasks=-1, storage_version=STORAGE_VERSION, verbose=False, trace_id=None, use_sklearn=True, mode=Mode.LOCAL, acn=None, ingest_resources=None, consolidate_partition_resources=None, copy_centroids_resources=None, random_sample_resources=None, kmeans_resources=None, compute_new_centroids_resources=None, assign_points_and_partial_new_centroids_resources=None, write_centroids_resources=None, partial_index_resources=None, distance_metric=vspy.DistanceMetric.SUM_OF_SQUARES, normalized=False, **kwargs)

Ingest vectors into TileDB.

Parameters

Name	Type	Description	Default
`index_type`	str	Type of vector index (FLAT, IVF_FLAT, IVF_PQ, VAMANA).	required
`index_uri`	str	Vector index URI (stored as TileDB group).	required
`input_vectors`	Optional[np.ndarray]	Input vectors, if this is provided it takes precedence over `source_uri` and `source_type`.	`None`
`source_uri`	Optional[str]	Vectors source URI.	`None`
`source_type`	Optional[str]	Type of the source vectors. If left empty it is auto-detected.	`None`
`external_ids`	Optional[np.array]	Input vector `external_ids`, if this is provided it takes precedence over `external_ids_uri` and `external_ids_type`.	`None`
`external_ids_uri`	Optional[str]	Source URI for `external_ids`.	`''`
`external_ids_type`	Optional[str]	File type of external_ids_uri. If left empty it is auto-detected.	`None`
`updates_uri`	Optional[str]	Updates array URI. Used for consolidation of updates.	`None`
`index_timestamp`	Optional[int]	Timestamp to use for writing and reading data. By default it uses the current unix ms timestamp.	`None`
`config`	Optional[Mapping[str, Any]]	TileDB config dictionary.	`None`
`namespace`	Optional[str]	TileDB-Cloud namespace to use for Cloud execution.	`None`
`size`	int	Number of input vectors, if not provided use the full size of the input dataset. If provided, we filter the first vectors from the input source.	`-1`
`partitions`	int	For IVF_FLAT and IVF_PQ indexes, the number of partitions to generate from the data during k-means clustering. If not provided, is auto-configured based on the dataset size.	`-1`
`num_subspaces`	int	For IVF_PQ encoded indexes, the number of subspaces to use in the PQ encoding. We will divide the dimensions into num_subspaces parts, and PQ encode each part separately. This means dimensions must be divisible by num_subspaces.	`-1`
`l_build`	int	For Vamana indexes, the number of neighbors considered for each node during construction of the graph. Larger values will take more time to build but result in indices that provide higher recall for the same search complexity. l_build should be >= r_max_degree unless you need to build indices quickly and can compromise on quality. Typically between 75 and 200. If not provided, use the default value of 100.	`-1`
`r_max_degree`	int	For Vamana indexes, the maximum degree for each node in the final graph. Larger values will result in larger indices and longer indexing times, but better search quality. Typically between 60 and 150. If not provided, use the default value of 64.	`-1`
`copy_centroids_uri`	Optional[str]	TileDB array URI to copy centroids from, if not provided, centroids are build running `k-means`.	`None`
`training_sample_size`	int	Sample size to use for computing `k-means`. If not provided, is auto-configured based on the dataset sizes. Should not be provided if training_source_uri is provided.	`-1`
`training_input_vectors`	Optional[np.ndarray]	Training input vectors, if this is provided it takes precedence over `training_source_uri` and `training_source_type`. Should not be provided if `training_sample_size` or `training_source_uri` are provided.	`None`
`training_source_uri`	Optional[str]	The source URI to use for training centroids when building a `IVF_FLAT` vector index. If not provided, the first `training_sample_size` vectors from `source_uri` are used. Should not be provided if training_sample_size or training_input_vectors is provided.	`None`
`training_source_type`	Optional[str]	Type of the training source data in `training_source_uri`. If left empty, is auto-detected. Should only be provided when `training_source_uri` is provided.	`None`
`workers`	int	Number of distributed workers to use for vector ingestion. If not provided, is auto-configured based on the dataset size.	`-1`
`input_vectors_per_work_item`	int	Number of vectors per ingestion work item. If not provided, is auto-configured.	`-1`
`max_tasks_per_stage`	int	Max number of tasks per execution stage of ingestion. If not provided, is auto-configured.	`-1`
`input_vectors_per_work_item_during_sampling`	int	Number of vectors per sample ingestion work item. iIf not provided, is auto-configured. Only valid with `training_sampling_policy=TrainingSamplingPolicy.RANDOM`.	`-1`
`max_sampling_tasks`	int	Max number of tasks per execution stage of sampling. If not provided, is auto-configured Only valid with `training_sampling_policy=TrainingSamplingPolicy.RANDOM`.	`-1`
`storage_version`	str	Vector index storage format version. If not provided, defaults to the latest version.	`STORAGE_VERSION`
`verbose`	bool	Enables verbose logging.	`False`
`trace_id`	Optional[str]	trace ID for logging.	`None`
`use_sklearn`	bool	Whether to use scikit-learn’s implementation of k-means clustering instead of tiledb.vector_search’s.	`True`
`mode`	Mode	Execution mode, defaults to `LOCAL` use `BATCH` for distributed execution.	`Mode.LOCAL`
`acn`	Optional[str]	Access credential name to be used when running in BATCH mode for object store access	`None`
`ingest_resources`	Optional[Mapping[str, Any]]	Resources to request when performing vector ingestion, only applies to BATCH mode	`None`
`consolidate_partition_resources`	Optional[Mapping[str, Any]]	Resources to request when performing consolidation of a partition, only applies to BATCH mode	`None`
`copy_centroids_resources`	Optional[Mapping[str, Any]]	Resources to request when performing copy of centroids from input array to output array, only applies to BATCH mode	`None`
`random_sample_resources`	Optional[Mapping[str, Any]]	Resources to request when performing random sample selection, only applies to BATCH mode	`None`
`kmeans_resources`	Optional[Mapping[str, Any]]	Resources to request when performing kmeans task, only applies to BATCH mode	`None`
`compute_new_centroids_resources`	Optional[Mapping[str, Any]]	Resources to request when performing centroid computation, only applies to BATCH mode	`None`
`assign_points_and_partial_new_centroids_resources`	Optional[Mapping[str, Any]]	Resources to request when performing the computation of partial centroids, only applies to BATCH mode	`None`
`write_centroids_resources`	Optional[Mapping[str, Any]]	Resources to request when performing the write of centroids, only applies to BATCH mode	`None`
`partial_index_resources`	Optional[Mapping[str, Any]]	Resources to request when performing the computation of partial indexing, only applies to BATCH mode	`None`
`distance_metric`	vspy.DistanceMetric	Distance metric to use for the index, defaults to ‘vspy.DistanceMetric.SUM_OF_SQUARES’. Options are ‘vspy.DistanceMetric.SUM_OF_SQUARES’, ‘vspy.DistanceMetric.INNER_PRODUCT’, ‘vspy.DistanceMetric.COSINE’, ‘vspy.DistanceMetric.L2’.	`vspy.DistanceMetric.SUM_OF_SQUARES`