files.indexing

cloud.files.indexing

Functions

Name Description
create_dataset_udf Create a TileDB vector search dataset.
index_files_udf Ingest files into a vector search text index.
ingest_files Ingest files into a vector search text index.

create_dataset_udf

cloud.files.indexing.create_dataset_udf(
    search_uri
    index_uri
    *
    config=None
    environment_variables=None
    verbose=False
    index_type=IndexTypes.IVF_FLAT
    index_creation_kwargs=None
    pattern='*'
    ignore=('[.]*', '*/[.]*')
    suffixes=None
    max_files=None
    text_splitter='RecursiveCharacterTextSplitter'
    text_splitter_kwargs=None
    embedding_class='LangChainEmbedding'
    embedding_kwargs=None
)

Create a TileDB vector search dataset.

index_files_udf

cloud.files.indexing.index_files_udf(
    index_uri
    *
    acn=None
    config=None
    environment_variables=None
    openai_key=None
    namespace=None
    verbose=False
    trace_id=None
    index_timestamp=None
    workers=-1
    worker_resources=None
    worker_image=None
    extra_worker_modules=None
    driver_resources=None
    driver_image=None
    extra_driver_modules=None
    max_tasks_per_stage=-1
    embeddings_generation_mode=dag.Mode.BATCH
    embeddings_generation_driver_mode=dag.Mode.BATCH
    vector_indexing_mode=dag.Mode.BATCH
    index_update_kwargs=None
)

Ingest files into a vector search text index.

ingest_files

cloud.files.indexing.ingest_files(
    search_uri
    index_uri
    *
    acn=None
    config=None
    environment_variables=None
    namespace=None
    verbose=False
    trace_id=None
    index_type=IndexTypes.IVF_FLAT
    index_creation_kwargs=None
    index_dag_resources=dag.MIN_BATCH_RESOURCES
    include='*'
    exclude=('[.]*', '*/[.]*')
    suffixes=None
    max_files=None
    text_splitter='RecursiveCharacterTextSplitter'
    text_splitter_kwargs=None
    embedding_class='LangChainEmbedding'
    embedding_kwargs=None
    openai_key=None
    index_timestamp=None
    workers=-1
    worker_resources=None
    worker_image=None
    extra_worker_modules=None
    driver_resources=None
    driver_image=None
    extra_driver_modules=None
    max_tasks_per_stage=-1
    embeddings_generation_mode=dag.Mode.BATCH
    embeddings_generation_driver_mode=dag.Mode.BATCH
    vector_indexing_mode=dag.Mode.BATCH
    index_update_kwargs=None
    threads='16'
    ingest_resources=None
    consolidate_partition_resources=None
    copy_centroids_resources=None
    random_sample_resources=None
    kmeans_resources=None
    compute_new_centroids_resources=None
    assign_points_and_partial_new_centroids_resources=None
    write_centroids_resources=None
    partial_index_resources=None
)

Ingest files into a vector search text index.

Parameters

Name Type Description Default
search_uri str Uri to load files from. This can be a directory URI or a FileStore file URI. required
index_uri str URI of the vector index to load files to. required
acn Optional[str] Access Credentials Name (ACN) registered in TileDB Cloud (ARN type), defaults to None. None
config Optional[dict] config dictionary, defaults to None. None
environment_variables Optional[Mapping[str, str]] Environment variables to use during ingestion. None
namespace Optional[str] TileDB-Cloud namespace, defaults to None. None
verbose bool verbose logging, defaults to False. False
trace_id Optional[str] trace ID for logging, defaults to None. # Vector Index params None
index_type IndexTypes Vector search index type (“FLAT”, “IVF_FLAT”). IndexTypes.IVF_FLAT
index_creation_kwargs Optional[Dict] Arguments to be passed to the index creation method. None
index_dag_resources Optional[Mapping[str, Any]] Index creation Node Specs configuration. # DirectoryTextReader params. dag.MIN_BATCH_RESOURCES
include str File pattern to include relative to search_uri. By default set to include all files. '*'
exclude Optional[Sequence[str]] File patterns to exclude relative to search_uri. By default set to ignore all hidden files. ('[.]*', '*/[.]*')
suffixes Optional[Sequence[str]] Provide to keep only files with these suffixes Useful when wanting to keep files with different suffixes Suffixes must include the dot, e.g. “.txt” None
max_files Optional[int] Maximum number of files to include. None
text_splitter_kwargs Optional[Dict] Arguments for the splitter class. # Index update params. None
index_timestamp Optional[int] Timestamp to add index updates at. None
workers int If embeddings_generation_mode=BATCH this is the number of distributed workers to be used. -1
worker_resources Optional[Dict] If embeddings_generation_mode=BATCH this can be used to specify the worker resources. None
worker_image Optional[str] If embeddings_generation_mode=BATCH this can be used to specify the worker Docker image. None
extra_worker_modules Optional[List[str]] If embeddings_generation_mode=BATCH this can be used to install extra pip package to the image. None
driver_resources Optional[Dict] If embeddings_generation_driver_mode=BATCH this can be used to specify the driver resources. None
driver_image Optional[str] If embeddings_generation_driver_mode=BATCH this can be used to specify the driver Docker image. None
extra_driver_modules Optional[List[str]] If embeddings_generation_driver_mode=BATCH this can be used to install extra pip package to the image. None
max_tasks_per_stage int Number of maximum udf tasks per computation stage. -1
embeddings_generation_mode dag.Mode TaskGraph execution mode for embeddings generation. dag.Mode.BATCH
embeddings_generation_driver_mode dag.Mode TaskGraph execution mode for the ingestion driver. dag.Mode.BATCH
vector_indexing_mode dag.Mode TaskGraph execution mode for the vector indexing. dag.Mode.BATCH
index_update_kwargs Optional[Dict] Extra arguments to pass to the index update job. These can be any of the documented tiledb.vector_search.ingest method with the exception of BATCH Embedding Resources (see next params): https://tiledb-inc.github.io/TileDB-Vector-Search/documentation/reference/ingestion.html#tiledb.vector_search.ingestion.ingest Also files_per_partition: int can be included (defaults to -1) ## Vector Search BATCH Embedding Resources ## These are only applicable if indexing update is executed in BATCH mode. None
threads str Threads to be used in the Nodes, defaults to 16. '16'
ingest_resources Optional[Dict] Resources to request when performing vector ingestion. None
consolidate_partition_resources Optional[Dict] Resources to request when performing consolidation of a partition. None
copy_centroids_resources Optional[Dict] Resources to request when performing copy of centroids from input array to output array. None
random_sample_resources Optional[Dict] Resources to request when performing random sample selection. None
kmeans_resources Optional[Dict] Resources to request when performing kmeans task. None
compute_new_centroids_resources Optional[Dict] Resources to request when performing centroid computation. None
assign_points_and_partial_new_centroids_resources Optional[Dict] Resources to request when performing the computation of partial centroids. None
write_centroids_resources Optional[Dict] Resources to request when performing the write of centroids. None
partial_index_resources Optional[Dict] Resources to request when performing the computation of partial indexing. None

Returns

Name Type Description
The resulting TaskGraph’s server UUID.