ingestion

vector_search.ingestion

Vector Search ingestion Utilities

This contains the ingestion implementation for different TileDB Vector Search algorithms.

It enables:

  • Local ingestion:
    • Multi-threaded execution that can leverage all the available local computing resources.
  • Distributed ingestion:
    • Distributed ingestion execution with multiple workers in TileDB Cloud. This can be used to ingest large datasets and speedup ingestion latency.

Functions

Name Description
ingest Ingest vectors into TileDB.

ingest

vector_search.ingestion.ingest(index_type, index_uri, *, input_vectors=None, source_uri=None, source_type=None, external_ids=None, external_ids_uri='', external_ids_type=None, updates_uri=None, index_timestamp=None, config=None, namespace=None, size=-1, partitions=-1, num_subspaces=-1, l_build=-1, r_max_degree=-1, training_sampling_policy=TrainingSamplingPolicy.FIRST_N, copy_centroids_uri=None, training_sample_size=-1, training_input_vectors=None, training_source_uri=None, training_source_type=None, workers=-1, input_vectors_per_work_item=-1, max_tasks_per_stage=-1, input_vectors_per_work_item_during_sampling=-1, max_sampling_tasks=-1, storage_version=STORAGE_VERSION, verbose=False, trace_id=None, use_sklearn=True, mode=Mode.LOCAL, acn=None, ingest_resources=None, consolidate_partition_resources=None, copy_centroids_resources=None, random_sample_resources=None, kmeans_resources=None, compute_new_centroids_resources=None, assign_points_and_partial_new_centroids_resources=None, write_centroids_resources=None, partial_index_resources=None, distance_metric=vspy.DistanceMetric.SUM_OF_SQUARES, normalized=False, **kwargs)

Ingest vectors into TileDB.

Parameters

Name Type Description Default
index_type str Type of vector index (FLAT, IVF_FLAT, IVF_PQ, VAMANA). required
index_uri str Vector index URI (stored as TileDB group). required
input_vectors Optional[np.ndarray] Input vectors, if this is provided it takes precedence over source_uri and source_type. None
source_uri Optional[str] Vectors source URI. None
source_type Optional[str] Type of the source vectors. If left empty it is auto-detected. None
external_ids Optional[np.array] Input vector external_ids, if this is provided it takes precedence over external_ids_uri and external_ids_type. None
external_ids_uri Optional[str] Source URI for external_ids. ''
external_ids_type Optional[str] File type of external_ids_uri. If left empty it is auto-detected. None
updates_uri Optional[str] Updates array URI. Used for consolidation of updates. None
index_timestamp Optional[int] Timestamp to use for writing and reading data. By default it uses the current unix ms timestamp. None
config Optional[Mapping[str, Any]] TileDB config dictionary. None
namespace Optional[str] TileDB-Cloud namespace to use for Cloud execution. None
size int Number of input vectors, if not provided use the full size of the input dataset. If provided, we filter the first vectors from the input source. -1
partitions int For IVF_FLAT and IVF_PQ indexes, the number of partitions to generate from the data during k-means clustering. If not provided, is auto-configured based on the dataset size. -1
num_subspaces int For IVF_PQ encoded indexes, the number of subspaces to use in the PQ encoding. We will divide the dimensions into num_subspaces parts, and PQ encode each part separately. This means dimensions must be divisible by num_subspaces. -1
l_build int For Vamana indexes, the number of neighbors considered for each node during construction of the graph. Larger values will take more time to build but result in indices that provide higher recall for the same search complexity. l_build should be >= r_max_degree unless you need to build indices quickly and can compromise on quality. Typically between 75 and 200. If not provided, use the default value of 100. -1
r_max_degree int For Vamana indexes, the maximum degree for each node in the final graph. Larger values will result in larger indices and longer indexing times, but better search quality. Typically between 60 and 150. If not provided, use the default value of 64. -1
copy_centroids_uri Optional[str] TileDB array URI to copy centroids from, if not provided, centroids are build running k-means. None
training_sample_size int Sample size to use for computing k-means. If not provided, is auto-configured based on the dataset sizes. Should not be provided if training_source_uri is provided. -1
training_input_vectors Optional[np.ndarray] Training input vectors, if this is provided it takes precedence over training_source_uri and training_source_type. Should not be provided if training_sample_size or training_source_uri are provided. None
training_source_uri Optional[str] The source URI to use for training centroids when building a IVF_FLAT vector index. If not provided, the first training_sample_size vectors from source_uri are used. Should not be provided if training_sample_size or training_input_vectors is provided. None
training_source_type Optional[str] Type of the training source data in training_source_uri. If left empty, is auto-detected. Should only be provided when training_source_uri is provided. None
workers int Number of distributed workers to use for vector ingestion. If not provided, is auto-configured based on the dataset size. -1
input_vectors_per_work_item int Number of vectors per ingestion work item. If not provided, is auto-configured. -1
max_tasks_per_stage int Max number of tasks per execution stage of ingestion. If not provided, is auto-configured. -1
input_vectors_per_work_item_during_sampling int Number of vectors per sample ingestion work item. iIf not provided, is auto-configured. Only valid with training_sampling_policy=TrainingSamplingPolicy.RANDOM. -1
max_sampling_tasks int Max number of tasks per execution stage of sampling. If not provided, is auto-configured Only valid with training_sampling_policy=TrainingSamplingPolicy.RANDOM. -1
storage_version str Vector index storage format version. If not provided, defaults to the latest version. STORAGE_VERSION
verbose bool Enables verbose logging. False
trace_id Optional[str] trace ID for logging. None
use_sklearn bool Whether to use scikit-learn’s implementation of k-means clustering instead of tiledb.vector_search’s. True
mode Mode Execution mode, defaults to LOCAL use BATCH for distributed execution. Mode.LOCAL
acn Optional[str] Access credential name to be used when running in BATCH mode for object store access None
ingest_resources Optional[Mapping[str, Any]] Resources to request when performing vector ingestion, only applies to BATCH mode None
consolidate_partition_resources Optional[Mapping[str, Any]] Resources to request when performing consolidation of a partition, only applies to BATCH mode None
copy_centroids_resources Optional[Mapping[str, Any]] Resources to request when performing copy of centroids from input array to output array, only applies to BATCH mode None
random_sample_resources Optional[Mapping[str, Any]] Resources to request when performing random sample selection, only applies to BATCH mode None
kmeans_resources Optional[Mapping[str, Any]] Resources to request when performing kmeans task, only applies to BATCH mode None
compute_new_centroids_resources Optional[Mapping[str, Any]] Resources to request when performing centroid computation, only applies to BATCH mode None
assign_points_and_partial_new_centroids_resources Optional[Mapping[str, Any]] Resources to request when performing the computation of partial centroids, only applies to BATCH mode None
write_centroids_resources Optional[Mapping[str, Any]] Resources to request when performing the write of centroids, only applies to BATCH mode None
partial_index_resources Optional[Mapping[str, Any]] Resources to request when performing the computation of partial indexing, only applies to BATCH mode None
distance_metric vspy.DistanceMetric Distance metric to use for the index, defaults to ‘vspy.DistanceMetric.SUM_OF_SQUARES’. Options are ‘vspy.DistanceMetric.SUM_OF_SQUARES’, ‘vspy.DistanceMetric.INNER_PRODUCT’, ‘vspy.DistanceMetric.COSINE’, ‘vspy.DistanceMetric.L2’. vspy.DistanceMetric.SUM_OF_SQUARES