ingestion
vector_search.ingestion
Vector Search ingestion Utilities
This contains the ingestion implementation for different TileDB Vector Search algorithms.
It enables:
- Local ingestion:
- Multi-threaded execution that can leverage all the available local computing resources.
- Distributed ingestion:
- Distributed ingestion execution with multiple workers in TileDB Cloud. This can be used to ingest large datasets and speedup ingestion latency.
Functions
Name | Description |
---|---|
ingest | Ingest vectors into TileDB. |
ingest
vector_search.ingestion.ingest(index_type, index_uri, *, input_vectors=None, source_uri=None, source_type=None, external_ids=None, external_ids_uri='', external_ids_type=None, updates_uri=None, index_timestamp=None, config=None, namespace=None, size=-1, partitions=-1, num_subspaces=-1, l_build=-1, r_max_degree=-1, training_sampling_policy=TrainingSamplingPolicy.FIRST_N, copy_centroids_uri=None, training_sample_size=-1, training_input_vectors=None, training_source_uri=None, training_source_type=None, workers=-1, input_vectors_per_work_item=-1, max_tasks_per_stage=-1, input_vectors_per_work_item_during_sampling=-1, max_sampling_tasks=-1, storage_version=STORAGE_VERSION, verbose=False, trace_id=None, use_sklearn=True, mode=Mode.LOCAL, acn=None, ingest_resources=None, consolidate_partition_resources=None, copy_centroids_resources=None, random_sample_resources=None, kmeans_resources=None, compute_new_centroids_resources=None, assign_points_and_partial_new_centroids_resources=None, write_centroids_resources=None, partial_index_resources=None, distance_metric=vspy.DistanceMetric.SUM_OF_SQUARES, normalized=False, **kwargs)
Ingest vectors into TileDB.
Parameters
Name | Type | Description | Default |
---|---|---|---|
index_type |
str | Type of vector index (FLAT, IVF_FLAT, IVF_PQ, VAMANA). | required |
index_uri |
str | Vector index URI (stored as TileDB group). | required |
input_vectors |
Optional[np.ndarray] | Input vectors, if this is provided it takes precedence over source_uri and source_type . |
None |
source_uri |
Optional[str] | Vectors source URI. | None |
source_type |
Optional[str] | Type of the source vectors. If left empty it is auto-detected. | None |
external_ids |
Optional[np.array] | Input vector external_ids , if this is provided it takes precedence over external_ids_uri and external_ids_type . |
None |
external_ids_uri |
Optional[str] | Source URI for external_ids . |
'' |
external_ids_type |
Optional[str] | File type of external_ids_uri. If left empty it is auto-detected. | None |
updates_uri |
Optional[str] | Updates array URI. Used for consolidation of updates. | None |
index_timestamp |
Optional[int] | Timestamp to use for writing and reading data. By default it uses the current unix ms timestamp. | None |
config |
Optional[Mapping[str, Any]] | TileDB config dictionary. | None |
namespace |
Optional[str] | TileDB-Cloud namespace to use for Cloud execution. | None |
size |
int | Number of input vectors, if not provided use the full size of the input dataset. If provided, we filter the first vectors from the input source. | -1 |
partitions |
int | For IVF_FLAT and IVF_PQ indexes, the number of partitions to generate from the data during k-means clustering. If not provided, is auto-configured based on the dataset size. | -1 |
num_subspaces |
int | For IVF_PQ encoded indexes, the number of subspaces to use in the PQ encoding. We will divide the dimensions into num_subspaces parts, and PQ encode each part separately. This means dimensions must be divisible by num_subspaces. | -1 |
l_build |
int | For Vamana indexes, the number of neighbors considered for each node during construction of the graph. Larger values will take more time to build but result in indices that provide higher recall for the same search complexity. l_build should be >= r_max_degree unless you need to build indices quickly and can compromise on quality. Typically between 75 and 200. If not provided, use the default value of 100. | -1 |
r_max_degree |
int | For Vamana indexes, the maximum degree for each node in the final graph. Larger values will result in larger indices and longer indexing times, but better search quality. Typically between 60 and 150. If not provided, use the default value of 64. | -1 |
copy_centroids_uri |
Optional[str] | TileDB array URI to copy centroids from, if not provided, centroids are build running k-means . |
None |
training_sample_size |
int | Sample size to use for computing k-means . If not provided, is auto-configured based on the dataset sizes. Should not be provided if training_source_uri is provided. |
-1 |
training_input_vectors |
Optional[np.ndarray] | Training input vectors, if this is provided it takes precedence over training_source_uri and training_source_type . Should not be provided if training_sample_size or training_source_uri are provided. |
None |
training_source_uri |
Optional[str] | The source URI to use for training centroids when building a IVF_FLAT vector index. If not provided, the first training_sample_size vectors from source_uri are used. Should not be provided if training_sample_size or training_input_vectors is provided. |
None |
training_source_type |
Optional[str] | Type of the training source data in training_source_uri . If left empty, is auto-detected. Should only be provided when training_source_uri is provided. |
None |
workers |
int | Number of distributed workers to use for vector ingestion. If not provided, is auto-configured based on the dataset size. | -1 |
input_vectors_per_work_item |
int | Number of vectors per ingestion work item. If not provided, is auto-configured. | -1 |
max_tasks_per_stage |
int | Max number of tasks per execution stage of ingestion. If not provided, is auto-configured. | -1 |
input_vectors_per_work_item_during_sampling |
int | Number of vectors per sample ingestion work item. iIf not provided, is auto-configured. Only valid with training_sampling_policy=TrainingSamplingPolicy.RANDOM . |
-1 |
max_sampling_tasks |
int | Max number of tasks per execution stage of sampling. If not provided, is auto-configured Only valid with training_sampling_policy=TrainingSamplingPolicy.RANDOM . |
-1 |
storage_version |
str | Vector index storage format version. If not provided, defaults to the latest version. | STORAGE_VERSION |
verbose |
bool | Enables verbose logging. | False |
trace_id |
Optional[str] | trace ID for logging. | None |
use_sklearn |
bool | Whether to use scikit-learn’s implementation of k-means clustering instead of tiledb.vector_search’s. | True |
mode |
Mode | Execution mode, defaults to LOCAL use BATCH for distributed execution. |
Mode.LOCAL |
acn |
Optional[str] | Access credential name to be used when running in BATCH mode for object store access | None |
ingest_resources |
Optional[Mapping[str, Any]] | Resources to request when performing vector ingestion, only applies to BATCH mode | None |
consolidate_partition_resources |
Optional[Mapping[str, Any]] | Resources to request when performing consolidation of a partition, only applies to BATCH mode | None |
copy_centroids_resources |
Optional[Mapping[str, Any]] | Resources to request when performing copy of centroids from input array to output array, only applies to BATCH mode | None |
random_sample_resources |
Optional[Mapping[str, Any]] | Resources to request when performing random sample selection, only applies to BATCH mode | None |
kmeans_resources |
Optional[Mapping[str, Any]] | Resources to request when performing kmeans task, only applies to BATCH mode | None |
compute_new_centroids_resources |
Optional[Mapping[str, Any]] | Resources to request when performing centroid computation, only applies to BATCH mode | None |
assign_points_and_partial_new_centroids_resources |
Optional[Mapping[str, Any]] | Resources to request when performing the computation of partial centroids, only applies to BATCH mode | None |
write_centroids_resources |
Optional[Mapping[str, Any]] | Resources to request when performing the write of centroids, only applies to BATCH mode | None |
partial_index_resources |
Optional[Mapping[str, Any]] | Resources to request when performing the computation of partial indexing, only applies to BATCH mode | None |
distance_metric |
vspy.DistanceMetric | Distance metric to use for the index, defaults to ‘vspy.DistanceMetric.SUM_OF_SQUARES’. Options are ‘vspy.DistanceMetric.SUM_OF_SQUARES’, ‘vspy.DistanceMetric.INNER_PRODUCT’, ‘vspy.DistanceMetric.COSINE’, ‘vspy.DistanceMetric.L2’. | vspy.DistanceMetric.SUM_OF_SQUARES |