files.indexing
cloud.files.indexing
Functions
Name | Description |
---|---|
create_dataset_udf | Create a TileDB vector search dataset. |
index_files_udf | Ingest files into a vector search text index. |
ingest_files | Ingest files into a vector search text index. |
create_dataset_udf
cloud.files.indexing.create_dataset_udf(
search_uri
index_uri*
=None
config=None
environment_variables=False
verbose=IndexTypes.IVF_FLAT
index_type=None
index_creation_kwargs='*'
pattern=('[.]*', '*/[.]*')
ignore=None
suffixes=None
max_files='RecursiveCharacterTextSplitter'
text_splitter=None
text_splitter_kwargs='LangChainEmbedding'
embedding_class=None
embedding_kwargs )
Create a TileDB vector search dataset.
index_files_udf
cloud.files.indexing.index_files_udf(
index_uri*
=None
acn=None
config=None
environment_variables=None
openai_key=None
namespace=False
verbose=None
trace_id=None
index_timestamp=-1
workers=None
worker_resources=None
worker_image=None
extra_worker_modules=None
driver_resources=None
driver_image=None
extra_driver_modules=-1
max_tasks_per_stage=dag.Mode.BATCH
embeddings_generation_mode=dag.Mode.BATCH
embeddings_generation_driver_mode=dag.Mode.BATCH
vector_indexing_mode=None
index_update_kwargs )
Ingest files into a vector search text index.
ingest_files
cloud.files.indexing.ingest_files(
search_uri
index_uri*
=None
acn=None
config=None
environment_variables=None
namespace=False
verbose=None
trace_id=IndexTypes.IVF_FLAT
index_type=None
index_creation_kwargs=dag.MIN_BATCH_RESOURCES
index_dag_resources='*'
include=('[.]*', '*/[.]*')
exclude=None
suffixes=None
max_files='RecursiveCharacterTextSplitter'
text_splitter=None
text_splitter_kwargs='LangChainEmbedding'
embedding_class=None
embedding_kwargs=None
openai_key=None
index_timestamp=-1
workers=None
worker_resources=None
worker_image=None
extra_worker_modules=None
driver_resources=None
driver_image=None
extra_driver_modules=-1
max_tasks_per_stage=dag.Mode.BATCH
embeddings_generation_mode=dag.Mode.BATCH
embeddings_generation_driver_mode=dag.Mode.BATCH
vector_indexing_mode=None
index_update_kwargs='16'
threads=None
ingest_resources=None
consolidate_partition_resources=None
copy_centroids_resources=None
random_sample_resources=None
kmeans_resources=None
compute_new_centroids_resources=None
assign_points_and_partial_new_centroids_resources=None
write_centroids_resources=None
partial_index_resources )
Ingest files into a vector search text index.
Parameters
Name | Type | Description | Default |
---|---|---|---|
search_uri | str | Uri to load files from. This can be a directory URI or a FileStore file URI. | required |
index_uri | str | URI of the vector index to load files to. | required |
acn | Optional[str] | Access Credentials Name (ACN) registered in TileDB Cloud (ARN type), defaults to None. | None |
config | Optional[dict] | config dictionary, defaults to None. | None |
environment_variables | Optional[Mapping[str, str]] | Environment variables to use during ingestion. | None |
namespace | Optional[str] | TileDB-Cloud namespace, defaults to None. | None |
verbose | bool | verbose logging, defaults to False. | False |
trace_id | Optional[str] | trace ID for logging, defaults to None. # Vector Index params | None |
index_type | IndexTypes | Vector search index type (“FLAT”, “IVF_FLAT”). | IndexTypes.IVF_FLAT |
index_creation_kwargs | Optional[Dict] | Arguments to be passed to the index creation method. | None |
index_dag_resources | Optional[Mapping[str, Any]] | Index creation Node Specs configuration. # DirectoryTextReader params. | dag.MIN_BATCH_RESOURCES |
include | str | File pattern to include relative to search_uri . By default set to include all files. |
'*' |
exclude | Optional[Sequence[str]] | File patterns to exclude relative to search_uri . By default set to ignore all hidden files. |
('[.]*', '*/[.]*') |
suffixes | Optional[Sequence[str]] | Provide to keep only files with these suffixes Useful when wanting to keep files with different suffixes Suffixes must include the dot, e.g. “.txt” | None |
max_files | Optional[int] | Maximum number of files to include. | None |
text_splitter_kwargs | Optional[Dict] | Arguments for the splitter class. # Index update params. | None |
index_timestamp | Optional[int] | Timestamp to add index updates at. | None |
workers | int | If embeddings_generation_mode=BATCH this is the number of distributed workers to be used. |
-1 |
worker_resources | Optional[Dict] | If embeddings_generation_mode=BATCH this can be used to specify the worker resources. |
None |
worker_image | Optional[str] | If embeddings_generation_mode=BATCH this can be used to specify the worker Docker image. |
None |
extra_worker_modules | Optional[List[str]] | If embeddings_generation_mode=BATCH this can be used to install extra pip package to the image. |
None |
driver_resources | Optional[Dict] | If embeddings_generation_driver_mode=BATCH this can be used to specify the driver resources. |
None |
driver_image | Optional[str] | If embeddings_generation_driver_mode=BATCH this can be used to specify the driver Docker image. |
None |
extra_driver_modules | Optional[List[str]] | If embeddings_generation_driver_mode=BATCH this can be used to install extra pip package to the image. |
None |
max_tasks_per_stage | int | Number of maximum udf tasks per computation stage. | -1 |
embeddings_generation_mode | dag.Mode | TaskGraph execution mode for embeddings generation. | dag.Mode.BATCH |
embeddings_generation_driver_mode | dag.Mode | TaskGraph execution mode for the ingestion driver. | dag.Mode.BATCH |
vector_indexing_mode | dag.Mode | TaskGraph execution mode for the vector indexing. | dag.Mode.BATCH |
index_update_kwargs | Optional[Dict] | Extra arguments to pass to the index update job. These can be any of the documented tiledb.vector_search.ingest method with the exception of BATCH Embedding Resources (see next params): https://tiledb-inc.github.io/TileDB-Vector-Search/documentation/reference/ingestion.html#tiledb.vector_search.ingestion.ingest Also files_per_partition: int can be included (defaults to -1) ## Vector Search BATCH Embedding Resources ## These are only applicable if indexing update is executed in BATCH mode. |
None |
threads | str | Threads to be used in the Nodes, defaults to 16. | '16' |
ingest_resources | Optional[Dict] | Resources to request when performing vector ingestion. | None |
consolidate_partition_resources | Optional[Dict] | Resources to request when performing consolidation of a partition. | None |
copy_centroids_resources | Optional[Dict] | Resources to request when performing copy of centroids from input array to output array. | None |
random_sample_resources | Optional[Dict] | Resources to request when performing random sample selection. | None |
kmeans_resources | Optional[Dict] | Resources to request when performing kmeans task. | None |
compute_new_centroids_resources | Optional[Dict] | Resources to request when performing centroid computation. | None |
assign_points_and_partial_new_centroids_resources | Optional[Dict] | Resources to request when performing the computation of partial centroids. | None |
write_centroids_resources | Optional[Dict] | Resources to request when performing the write of centroids. | None |
partial_index_resources | Optional[Dict] | Resources to request when performing the computation of partial indexing. | None |
Returns
Name | Type | Description |
---|---|---|
The resulting TaskGraph’s server UUID. |