files.indexing
cloud.files.indexing
Functions
| Name | Description |
|---|---|
| create_dataset_udf | Create a TileDB vector search dataset. |
| index_files_udf | Ingest files into a vector search text index. |
| ingest_files | Ingest files into a vector search text index. |
create_dataset_udf
cloud.files.indexing.create_dataset_udf(
search_uri,
index_uri,
*,
config=None,
environment_variables=None,
verbose=False,
index_type=IndexTypes.IVF_FLAT,
index_creation_kwargs=None,
pattern='*',
ignore=('[.]*', '*/[.]*'),
suffixes=None,
max_files=None,
text_splitter='RecursiveCharacterTextSplitter',
text_splitter_kwargs=None,
embedding_class='LangChainEmbedding',
embedding_kwargs=None,
)Create a TileDB vector search dataset.
index_files_udf
cloud.files.indexing.index_files_udf(
index_uri,
*,
acn=None,
config=None,
environment_variables=None,
openai_key=None,
namespace=None,
verbose=False,
trace_id=None,
index_timestamp=None,
workers=-1,
worker_resources=None,
worker_image=None,
extra_worker_modules=None,
driver_resources=None,
driver_image=None,
extra_driver_modules=None,
max_tasks_per_stage=-1,
embeddings_generation_mode=dag.Mode.BATCH,
embeddings_generation_driver_mode=dag.Mode.BATCH,
vector_indexing_mode=dag.Mode.BATCH,
index_update_kwargs=None,
)Ingest files into a vector search text index.
ingest_files
cloud.files.indexing.ingest_files(
search_uri,
index_uri,
*,
acn=None,
config=None,
environment_variables=None,
namespace=None,
verbose=False,
trace_id=None,
index_type=IndexTypes.IVF_FLAT,
index_creation_kwargs=None,
index_dag_resources=dag.MIN_BATCH_RESOURCES,
include='*',
exclude=('[.]*', '*/[.]*'),
suffixes=None,
max_files=None,
text_splitter='RecursiveCharacterTextSplitter',
text_splitter_kwargs=None,
embedding_class='LangChainEmbedding',
embedding_kwargs=None,
openai_key=None,
index_timestamp=None,
workers=-1,
worker_resources=None,
worker_image=None,
extra_worker_modules=None,
driver_resources=None,
driver_image=None,
extra_driver_modules=None,
max_tasks_per_stage=-1,
embeddings_generation_mode=dag.Mode.BATCH,
embeddings_generation_driver_mode=dag.Mode.BATCH,
vector_indexing_mode=dag.Mode.BATCH,
index_update_kwargs=None,
threads='16',
ingest_resources=None,
consolidate_partition_resources=None,
copy_centroids_resources=None,
random_sample_resources=None,
kmeans_resources=None,
compute_new_centroids_resources=None,
assign_points_and_partial_new_centroids_resources=None,
write_centroids_resources=None,
partial_index_resources=None,
)Ingest files into a vector search text index.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| search_uri | str | Uri to load files from. This can be a directory URI or a FileStore file URI. | required |
| index_uri | str | URI of the vector index to load files to. | required |
| acn | Optional[str] | Access Credentials Name (ACN) registered in TileDB Cloud (ARN type), defaults to None. | None |
| config | Optional[dict] | config dictionary, defaults to None. | None |
| environment_variables | Optional[Mapping[str, str]] | Environment variables to use during ingestion. | None |
| namespace | Optional[str] | TileDB-Cloud namespace, defaults to None. | None |
| verbose | bool | verbose logging, defaults to False. | False |
| trace_id | Optional[str] | trace ID for logging, defaults to None. # Vector Index params | None |
| index_type | IndexTypes | Vector search index type (“FLAT”, “IVF_FLAT”). | IndexTypes.IVF_FLAT |
| index_creation_kwargs | Optional[Dict] | Arguments to be passed to the index creation method. | None |
| index_dag_resources | Optional[Mapping[str, Any]] | Index creation Node Specs configuration. # DirectoryTextReader params. | dag.MIN_BATCH_RESOURCES |
| include | str | File pattern to include relative to search_uri. By default set to include all files. |
'*' |
| exclude | Optional[Sequence[str]] | File patterns to exclude relative to search_uri. By default set to ignore all hidden files. |
('[.]*', '*/[.]*') |
| suffixes | Optional[Sequence[str]] | Provide to keep only files with these suffixes Useful when wanting to keep files with different suffixes Suffixes must include the dot, e.g. “.txt” | None |
| max_files | Optional[int] | Maximum number of files to include. | None |
| text_splitter_kwargs | Optional[Dict] | Arguments for the splitter class. # Index update params. | None |
| index_timestamp | Optional[int] | Timestamp to add index updates at. | None |
| workers | int | If embeddings_generation_mode=BATCH this is the number of distributed workers to be used. |
-1 |
| worker_resources | Optional[Dict] | If embeddings_generation_mode=BATCH this can be used to specify the worker resources. |
None |
| worker_image | Optional[str] | If embeddings_generation_mode=BATCH this can be used to specify the worker Docker image. |
None |
| extra_worker_modules | Optional[List[str]] | If embeddings_generation_mode=BATCH this can be used to install extra pip package to the image. |
None |
| driver_resources | Optional[Dict] | If embeddings_generation_driver_mode=BATCH this can be used to specify the driver resources. |
None |
| driver_image | Optional[str] | If embeddings_generation_driver_mode=BATCH this can be used to specify the driver Docker image. |
None |
| extra_driver_modules | Optional[List[str]] | If embeddings_generation_driver_mode=BATCH this can be used to install extra pip package to the image. |
None |
| max_tasks_per_stage | int | Number of maximum udf tasks per computation stage. | -1 |
| embeddings_generation_mode | dag.Mode | TaskGraph execution mode for embeddings generation. | dag.Mode.BATCH |
| embeddings_generation_driver_mode | dag.Mode | TaskGraph execution mode for the ingestion driver. | dag.Mode.BATCH |
| vector_indexing_mode | dag.Mode | TaskGraph execution mode for the vector indexing. | dag.Mode.BATCH |
| index_update_kwargs | Optional[Dict] | Extra arguments to pass to the index update job. These can be any of the documented tiledb.vector_search.ingest method with the exception of BATCH Embedding Resources (see next params): https://tiledb-inc.github.io/TileDB-Vector-Search/documentation/reference/ingestion.html#tiledb.vector_search.ingestion.ingest Also files_per_partition: int can be included (defaults to -1) ## Vector Search BATCH Embedding Resources ## These are only applicable if indexing update is executed in BATCH mode. |
None |
| threads | str | Threads to be used in the Nodes, defaults to 16. | '16' |
| ingest_resources | Optional[Dict] | Resources to request when performing vector ingestion. | None |
| consolidate_partition_resources | Optional[Dict] | Resources to request when performing consolidation of a partition. | None |
| copy_centroids_resources | Optional[Dict] | Resources to request when performing copy of centroids from input array to output array. | None |
| random_sample_resources | Optional[Dict] | Resources to request when performing random sample selection. | None |
| kmeans_resources | Optional[Dict] | Resources to request when performing kmeans task. | None |
| compute_new_centroids_resources | Optional[Dict] | Resources to request when performing centroid computation. | None |
| assign_points_and_partial_new_centroids_resources | Optional[Dict] | Resources to request when performing the computation of partial centroids. | None |
| write_centroids_resources | Optional[Dict] | Resources to request when performing the write of centroids. | None |
| partial_index_resources | Optional[Dict] | Resources to request when performing the computation of partial indexing. | None |
Returns
| Name | Type | Description |
|---|---|---|
| The resulting TaskGraph’s server UUID. |