Storage Format Spec

Learn about the vector search storage format specification for different indexing algorithms.

The underlying storage model used for indexing vectors in TileDB-Vector-Search is heavily dependent on the indexing algorithm used. However, there are also high level structures that are used across algorithms.

Cross algorithm storage format

All data and metadata required for a TileDB-Vector-Search index are stored inside a TileDB group (index_uri). All the listed, named arrays below are stored under this URI.

Index metadata

Metadata values required for configuring the different properties of an index are stored in the index_uri group metadata. There are some metadata values that are required for all algorithm implementations as well as per-algorithm specific metadata values. Below is a table of all the metadata values that are recorded for all algorithms.

Name Description
dataset_type The asset type for disambiguation in TileDB cloud. Value: vector_search
index_type The index algorithm used for this index. Can be one of the following values: FLAT, IVF_FLAT, VAMANA, IVF_PQ
storage_version The storage version used for the index. The storage version is used to make sure that indexing algorithms can update their storage logic without affecting previously created indexes and maintaining backwards compatibility.
dtype The data type of the vector values.
ingestion_timestamps An ordered list of timestamps that correspond to different calls of ingestion and update consolidation through the lifetime of the index.
base_sizes An ordered list of number of vectors in the base index at the different ingestion timestamps.
has_updates Boolean value denoting if there are updates recorded in the updates array.

Object metadata

This is a 1D sparse array with external_id as dimension and attributes the user defined metadata attributes for the respective vectors.

Basic schema parameters

Parameter Value
Array type Sparse
Rank 1D
Cell order Row-major
Tile order Row-major

Dimensions

Dimension Name TileDB Datatype
external_id uint64_t

Updates

TileDB-Vector-Search offers support for updates for all different index algorithms by recording updates outside the main indexing storage structure and periodically consolidating them. This implementation is using the updates array, a sparse 1D array with dimension the external_ids of the vectors and 1 variable length attribute encoding the vector itself or an empty value if the vector is deleted.

Basic schema parameters

Parameter Value
Array type Sparse
Rank 1D
Cell order Row-major
Tile order Row-major

Dimensions

Dimension Name TileDB Datatype
external_id uint64_t

Attributes

Attribute Name TileDB Datatype Description
vector variable dtype Contains the vector value. Empty values correspond to vector deletions.

Algorithm specific storage format

FLAT

shuffled_vectors

This is a 2D dense array that holds all the vectors with no specific ordering.

Basic schema parameters

Parameter Value
Array type Dense
Rank 2D
Cell order Col-major
Tile order Col-major

Dimensions

Dimension Name TileDB Datatype Domain Description
rows int32_t [0, dimensions] Corresponds to the vector dimensions.
cols int32_t [0, MAX_INT32] Corresponds to the vector position in the set of vectors.

Attributes

Attribute Name TileDB Datatype Description
values dtype Contains the vector value at the specific dimension.

shuffled_ids

This is a 1D dense array that maps vector positions in the shuffled_vectors array to external_ids of each vector.

Basic schema parameters

Parameter Value
Array type Dense
Rank 1D
Cell order Col-major
Tile order Col-major

Dimensions

Dimension Name TileDB Datatype Domain Description
rows int32_t [0, MAX_INT32] Corresponds to the vector position in shuffled_vectors.

Attributes

Attribute Name TileDB Datatype Description
values uint64_t Contains the vector’s external_id.

IVF_FLAT

Metadata

Name Description
partition_history An ordered list of the number of partitions used at different ingestion timestamps.

partition_centroids

This is a 2D dense array storing the k-means centroids for the different vector partitions.

Basic schema parameters

Parameter Value
Array type Dense
Rank 2D
Cell order Col-major
Tile order Col-major

Dimensions

Dimension Name TileDB Datatype Domain Description
rows int32_t [0, dimensions] Corresponds to the centroid dimensions.
cols int32_t [0, MAX_INT32] Corresponds to the centroid id.

Attributes

Attribute Name TileDB Datatype Description
centroids dtype Contains the centroid value at the specific dimension.

partition_indexes

This is a 1D dense array recording the start-end index of each partition of vectors in the shuffled_vectors array.

Basic schema parameters

Parameter Value
Array type Dense
Rank 1D
Cell order Col-major
Tile order Col-major

Dimensions

Dimension Name TileDB Datatype Domain Description
rows int32_t [0, MAX_INT32] Corresponds to the partition id.

Attributes

Attribute Name TileDB Datatype Description
values uint64_t Contains to the position of the partition split in the shuffled_vectors array.

shuffled_vectors

This is a 2D dense array that holds all the vectors. Each vector partition is stored in a consecutive index range of this array.

Basic schema parameters

Parameter Value
Array type Dense
Rank 2D
Cell order Col-major
Tile order Col-major

Dimensions

Dimension Name TileDB Datatype Domain Description
rows int32_t [0, dimensions] Corresponds to the vector dimensions.
cols int32_t [0, MAX_INT32] Corresponds to the vector position in the set of vectors.

Attributes

Attribute Name TileDB Datatype Description
values dtype Contains the vector value at the specific dimension.

shuffled_ids

This is a 1D dense array that maps vector indices in the shuffled_vectors array to external_ids of each vector.

Basic schema parameters

Parameter Value
Array type Dense
Rank 1D
Cell order Col-major
Tile order Col-major

Dimensions

Dimension Name TileDB Datatype Domain Description
rows int32_t [0, MAX_INT32] Corresponds to the vector position in shuffled_vectors.

Attributes

Attribute Name TileDB Datatype Description
values uint64_t Contains the vector external_id.

VAMANA

Metadata

Name Description
l_build The l_build parameter used when constructing the graph.
r_max_degree The r_max_degree parameter used when constructing the graph.

shuffled_vectors

This is a 2D dense array that holds all the vectors. Each vector partition is stored in a consecutive index range of this array.

Basic schema parameters

Parameter Value
Array type Dense
Rank 2D
Cell order Col-major
Tile order Col-major

Dimensions

Dimension Name TileDB Datatype Domain Description
rows int32_t [0, dimensions] Corresponds to the vector dimensions.
cols int32_t [0, MAX_INT32] Corresponds to the vector position in the set of vectors.

Attributes

Attribute Name TileDB Datatype Description
values dtype Contains the vector value at the specific dimension.

shuffled_ids

This is a 1D dense array that maps vector indices in the shuffled_vectors array to external_ids of each vector.

Basic schema parameters

Parameter Value
Array type Dense
Rank 1D
Cell order Col-major
Tile order Col-major

Dimensions

Dimension Name TileDB Datatype Domain Description
rows int32_t [0, MAX_INT32] Corresponds to the vector position in shuffled_vectors.

Attributes

Attribute Name TileDB Datatype Description
values uint64_t Contains the vector external_id.

adjacency_row_index_array_name

This is a 1D dense array that holds the edges for each node in the compressed sparse row (CSR) format graph. Each value indicates where the neighbors (edges) for each successive node start in adjacency_ids and adjacency_scores. For example, we might have [0, 2, 8, 13] which indicates that the neighbors for node 0 start at index 0, the neighbors for node 1 start at index 2, and the neighbors for node 2 start at index 8. The final value is the end of the array, so the neighbors for node 2 end at index 13. With that information, we can look in adjacency_ids to determine the destination node. The source node can be inferred by the index of the Adjacency Row Indices array. Once you know the source or destination node index, you can look at that index in shuffled_vectors or shuffled_ids to get the vector or external ID for that node.

Basic schema parameters

Parameter Value
Array type Dense
Rank 1D
Cell order Col-major
Tile order Col-major

Dimensions

Dimension Name TileDB Datatype Domain Description
rows int32_t [0, MAX_INT32] Corresponds to the vector position in shuffled_vectors and shuffled_ids.

Attributes

Attribute Name TileDB Datatype Description
values uint64_t Contains the start and stop indexes in adjacency_ids and adjacency_scores for the node.

adjacency_ids

This is a 1D dense array that holds the indexes of the destination vector for each edge in the compressed sparse row (CSR) format graph. Each value is an index into the shuffled_vectors and shuffled_ids arrays. This only holds the destination nodes of the graph, the source node is in adjacency_row_index_array_name, which itself points to adjacency_ids.

Basic schema parameters

Parameter Value
Array type Dense
Rank 1D
Cell order Col-major
Tile order Col-major

Dimensions

Dimension Name TileDB Datatype Domain Description
rows int32_t [0, MAX_INT32] Corresponds to the vector position in shuffled_vectors and shuffled_ids.

Attributes

Attribute Name TileDB Datatype Description
values uint64_t Contains the index of the destination vector for this edge in the graph.

adjacency_scores

This is a 1D dense array that holds the distance of the edge in adjacency_ids in the compressed sparse row (CSR) format graph. This follows the same pattern as adjacency_ids, but holds the edge distance instead of the destination node.

Basic schema parameters

Parameter Value
Array type Dense
Rank 1D
Cell order Col-major
Tile order Col-major

Dimensions

Dimension Name TileDB Datatype Domain Description
rows int32_t [0, MAX_INT32] Corresponds to the vector position in adjacency_ids.

Attributes

Attribute Name TileDB Datatype Description
values float Contains the distance between neighbors in the graph.