ingestion
vector_search.ingestion
Vector Search ingestion Utilities
This contains the ingestion implementation for different TileDB Vector Search algorithms.
It enables:
- Local ingestion:
- Multi-threaded execution that can leverage all the available local computing resources.
- Distributed ingestion:
- Distributed ingestion execution with multiple workers in TileDB Cloud. This can be used to ingest large datasets and speedup ingestion latency.
Functions
Name | Description |
---|---|
ingest | Ingest vectors into TileDB. |
ingest
vector_search.ingestion.ingest(index_type, index_uri, *, input_vectors=None, source_uri=None, source_type=None, external_ids=None, external_ids_uri='', external_ids_type=None, updates_uri=None, index_timestamp=None, config=None, namespace=None, size=-1, partitions=-1, training_sampling_policy=TrainingSamplingPolicy.FIRST_N, copy_centroids_uri=None, training_sample_size=-1, training_input_vectors=None, training_source_uri=None, training_source_type=None, workers=-1, input_vectors_per_work_item=-1, max_tasks_per_stage=-1, input_vectors_per_work_item_during_sampling=-1, max_sampling_tasks=-1, storage_version=STORAGE_VERSION, verbose=False, trace_id=None, use_sklearn=True, mode=Mode.LOCAL, acn=None, ingest_resources=None, consolidate_partition_resources=None, copy_centroids_resources=None, random_sample_resources=None, kmeans_resources=None, compute_new_centroids_resources=None, assign_points_and_partial_new_centroids_resources=None, write_centroids_resources=None, partial_index_resources=None, **kwargs)
Ingest vectors into TileDB.
Parameters
Name | Type | Description | Default |
---|---|---|---|
index_type |
str | Type of vector index (FLAT, IVF_FLAT, VAMANA). | required |
index_uri |
str | Vector index URI (stored as TileDB group). | required |
input_vectors |
np.ndarray | Input vectors, if this is provided it takes precedence over source_uri and source_type . |
None |
source_uri |
str | Vectors source URI. | None |
source_type |
str | Type of the source vectors. If left empty it is auto-detected. | None |
external_ids |
np.array | Input vector external_ids , if this is provided it takes precedence over external_ids_uri and external_ids_type . |
None |
external_ids_uri |
str | Source URI for external_ids . |
'' |
external_ids_type |
str | File type of external_ids_uri. If left empty it is auto-detected. | None |
updates_uri |
str | Updates array URI. Used for consolidation of updates. | None |
index_timestamp |
int | Timestamp to use for writing and reading data. By default it uses the current unix ms timestamp. | None |
config |
Optional[Mapping[str, Any]] | TileDB config dictionary. | None |
namespace |
Optional[str] | TileDB-Cloud namespace to use for Cloud execution. | None |
size |
int | Number of input vectors, if not provided use the full size of the input dataset. If provided, we filter the first vectors from the input source. | -1 |
partitions |
int | Number of partitions to load the data with, if not provided, is auto-configured based on the dataset size. | -1 |
copy_centroids_uri |
str | TileDB array URI to copy centroids from, if not provided, centroids are build running k-means . |
None |
training_sample_size |
int | Sample size to use for computing k-means . If not provided, is auto-configured based on the dataset sizes. Should not be provided if training_source_uri is provided. |
-1 |
training_input_vectors |
np.ndarray | Training input vectors, if this is provided it takes precedence over training_source_uri and training_source_type . Should not be provided if training_sample_size or training_source_uri are provided. |
None |
training_source_uri |
str | The source URI to use for training centroids when building a IVF_FLAT vector index. If not provided, the first training_sample_size vectors from source_uri are used. Should not be provided if training_sample_size or training_input_vectors is provided. |
None |
training_source_type |
str | Type of the training source data in training_source_uri . If left empty, is auto-detected. Should only be provided when training_source_uri is provided. |
None |
workers |
int | Number of distributed workers to use for vector ingestion. If not provided, is auto-configured based on the dataset size. | -1 |
input_vectors_per_work_item |
int | Number of vectors per ingestion work item. If not provided, is auto-configured. | -1 |
max_tasks_per_stage |
int | Max number of tasks per execution stage of ingestion. If not provided, is auto-configured. | -1 |
input_vectors_per_work_item_during_sampling |
int | Number of vectors per sample ingestion work item. iIf not provided, is auto-configured. Only valid with training_sampling_policy=TrainingSamplingPolicy.RANDOM . |
-1 |
max_sampling_tasks |
int | Max number of tasks per execution stage of sampling. If not provided, is auto-configured Only valid with training_sampling_policy=TrainingSamplingPolicy.RANDOM . |
-1 |
storage_version |
str | Vector index storage format version. If not provided, defaults to the latest version. | STORAGE_VERSION |
verbose |
bool | Enables verbose logging. | False |
trace_id |
Optional[str] | trace ID for logging. | None |
use_sklearn |
bool | Whether to use scikit-learn’s implementation of k-means clustering instead of tiledb.vector_search’s. | True |
mode |
Mode | Execution mode, defaults to LOCAL use BATCH for distributed execution. |
Mode.LOCAL |
acn |
Optional[str] | Access credential name to be used when running in BATCH mode for object store access | None |
ingest_resources |
Optional[Mapping[str, Any]] | Resources to request when performing vector ingestion, only applies to BATCH mode | None |
consolidate_partition_resources |
Optional[Mapping[str, Any]] | Resources to request when performing consolidation of a partition, only applies to BATCH mode | None |
copy_centroids_resources |
Optional[Mapping[str, Any]] | Resources to request when performing copy of centroids from input array to output array, only applies to BATCH mode | None |
random_sample_resources |
Optional[Mapping[str, Any]] | Resources to request when performing random sample selection, only applies to BATCH mode | None |
kmeans_resources |
Optional[Mapping[str, Any]] | Resources to request when performing kmeans task, only applies to BATCH mode | None |
compute_new_centroids_resources |
Optional[Mapping[str, Any]] | Resources to request when performing centroid computation, only applies to BATCH mode | None |
assign_points_and_partial_new_centroids_resources |
Optional[Mapping[str, Any]] | Resources to request when performing the computation of partial centroids, only applies to BATCH mode | None |
write_centroids_resources |
Optional[Mapping[str, Any]] | Resources to request when performing the write of centroids, only applies to BATCH mode | None |
partial_index_resources |
Optional[Mapping[str, Any]] | Resources to request when performing the computation of partial indexing, only applies to BATCH mode | None |