ingestion

vector_search.ingestion

Vector Search ingestion Utilities

This contains the ingestion implementation for different TileDB Vector Search algorithms.

It enables:

  • Local ingestion:
    • Multi-threaded execution that can leverage all the available local computing resources.
  • Distributed ingestion:
    • Distributed ingestion execution with multiple workers in TileDB Cloud. This can be used to ingest large datasets and speedup ingestion latency.

Functions

Name Description
ingest Ingest vectors into TileDB.

ingest

vector_search.ingestion.ingest(index_type, index_uri, *, input_vectors=None, source_uri=None, source_type=None, external_ids=None, external_ids_uri='', external_ids_type=None, updates_uri=None, index_timestamp=None, config=None, namespace=None, size=-1, partitions=-1, training_sampling_policy=TrainingSamplingPolicy.FIRST_N, copy_centroids_uri=None, training_sample_size=-1, training_input_vectors=None, training_source_uri=None, training_source_type=None, workers=-1, input_vectors_per_work_item=-1, max_tasks_per_stage=-1, input_vectors_per_work_item_during_sampling=-1, max_sampling_tasks=-1, storage_version=STORAGE_VERSION, verbose=False, trace_id=None, use_sklearn=True, mode=Mode.LOCAL, acn=None, ingest_resources=None, consolidate_partition_resources=None, copy_centroids_resources=None, random_sample_resources=None, kmeans_resources=None, compute_new_centroids_resources=None, assign_points_and_partial_new_centroids_resources=None, write_centroids_resources=None, partial_index_resources=None, **kwargs)

Ingest vectors into TileDB.

Parameters

Name Type Description Default
index_type str Type of vector index (FLAT, IVF_FLAT, VAMANA). required
index_uri str Vector index URI (stored as TileDB group). required
input_vectors np.ndarray Input vectors, if this is provided it takes precedence over source_uri and source_type. None
source_uri str Vectors source URI. None
source_type str Type of the source vectors. If left empty it is auto-detected. None
external_ids np.array Input vector external_ids, if this is provided it takes precedence over external_ids_uri and external_ids_type. None
external_ids_uri str Source URI for external_ids. ''
external_ids_type str File type of external_ids_uri. If left empty it is auto-detected. None
updates_uri str Updates array URI. Used for consolidation of updates. None
index_timestamp int Timestamp to use for writing and reading data. By default it uses the current unix ms timestamp. None
config Optional[Mapping[str, Any]] TileDB config dictionary. None
namespace Optional[str] TileDB-Cloud namespace to use for Cloud execution. None
size int Number of input vectors, if not provided use the full size of the input dataset. If provided, we filter the first vectors from the input source. -1
partitions int Number of partitions to load the data with, if not provided, is auto-configured based on the dataset size. -1
copy_centroids_uri str TileDB array URI to copy centroids from, if not provided, centroids are build running k-means. None
training_sample_size int Sample size to use for computing k-means. If not provided, is auto-configured based on the dataset sizes. Should not be provided if training_source_uri is provided. -1
training_input_vectors np.ndarray Training input vectors, if this is provided it takes precedence over training_source_uri and training_source_type. Should not be provided if training_sample_size or training_source_uri are provided. None
training_source_uri str The source URI to use for training centroids when building a IVF_FLAT vector index. If not provided, the first training_sample_size vectors from source_uri are used. Should not be provided if training_sample_size or training_input_vectors is provided. None
training_source_type str Type of the training source data in training_source_uri. If left empty, is auto-detected. Should only be provided when training_source_uri is provided. None
workers int Number of distributed workers to use for vector ingestion. If not provided, is auto-configured based on the dataset size. -1
input_vectors_per_work_item int Number of vectors per ingestion work item. If not provided, is auto-configured. -1
max_tasks_per_stage int Max number of tasks per execution stage of ingestion. If not provided, is auto-configured. -1
input_vectors_per_work_item_during_sampling int Number of vectors per sample ingestion work item. iIf not provided, is auto-configured. Only valid with training_sampling_policy=TrainingSamplingPolicy.RANDOM. -1
max_sampling_tasks int Max number of tasks per execution stage of sampling. If not provided, is auto-configured Only valid with training_sampling_policy=TrainingSamplingPolicy.RANDOM. -1
storage_version str Vector index storage format version. If not provided, defaults to the latest version. STORAGE_VERSION
verbose bool Enables verbose logging. False
trace_id Optional[str] trace ID for logging. None
use_sklearn bool Whether to use scikit-learn’s implementation of k-means clustering instead of tiledb.vector_search’s. True
mode Mode Execution mode, defaults to LOCAL use BATCH for distributed execution. Mode.LOCAL
acn Optional[str] Access credential name to be used when running in BATCH mode for object store access None
ingest_resources Optional[Mapping[str, Any]] Resources to request when performing vector ingestion, only applies to BATCH mode None
consolidate_partition_resources Optional[Mapping[str, Any]] Resources to request when performing consolidation of a partition, only applies to BATCH mode None
copy_centroids_resources Optional[Mapping[str, Any]] Resources to request when performing copy of centroids from input array to output array, only applies to BATCH mode None
random_sample_resources Optional[Mapping[str, Any]] Resources to request when performing random sample selection, only applies to BATCH mode None
kmeans_resources Optional[Mapping[str, Any]] Resources to request when performing kmeans task, only applies to BATCH mode None
compute_new_centroids_resources Optional[Mapping[str, Any]] Resources to request when performing centroid computation, only applies to BATCH mode None
assign_points_and_partial_new_centroids_resources Optional[Mapping[str, Any]] Resources to request when performing the computation of partial centroids, only applies to BATCH mode None
write_centroids_resources Optional[Mapping[str, Any]] Resources to request when performing the write of centroids, only applies to BATCH mode None
partial_index_resources Optional[Mapping[str, Any]] Resources to request when performing the computation of partial indexing, only applies to BATCH mode None