You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Builds a `PineconeIndex` containing a Pinecone context (API key, index and namespace).
755
+
The index stores the document chunks and their embeddings (and potentially other information).
756
+
757
+
The function processes each file or document (depending on `chunker`), splits its content into chunks, embeds these chunks
758
+
and then combines this information into a retrievable index. The chunks and embeddings are upsert to Pinecone using
759
+
the provided Pinecone context (unless the `upsert` flag is set to `false`).
760
+
761
+
# Arguments
762
+
- `indexer::PineconeIndexer`: The indexing logic for Pinecone operations.
763
+
- `files_or_docs`: A vector of valid file paths to be indexed (chunked and embedded).
764
+
- `metadata::Vector{Dict{String, Any}}`: A vector of metadata attributed to each docs file, given as dictionaries with `String` keys. Default is empty vector.
765
+
- `pinecone_context::Pinecone.PineconeContextv3`: The Pinecone API key generated using Pinecone.jl. Must be specified.
766
+
- `pinecone_index::Pinecone.PineconeIndexv3`: The Pinecone index generated using Pinecone.jl. Must be specified.
767
+
- `pinecone_namespace::AbstractString`: The Pinecone namespace associated to `pinecone_index`.
768
+
- `upsert::Bool = true`: A flag specifying whether to upsert the chunks and embeddings to Pinecone. Defaults to `true`.
769
+
- `verbose`: An Integer specifying the verbosity of the logs. Default is `1` (high-level logging). `0` is disabled.
770
+
- `index_id`: A unique identifier for the index. Default is a generated symbol.
771
+
- `chunker`: The chunker logic to use for splitting the documents. Default is `TextChunker()`.
772
+
- `chunker_kwargs`: Parameters to be provided to the `get_chunks` function. Useful to change the `separators` or `max_length`.
773
+
- `sources`: A vector of strings indicating the source of each chunk. Default is equal to `files_or_docs`.
774
+
- `embedder`: The embedder logic to use for embedding the chunks. Default is `BatchEmbedder()`.
775
+
- `embedder_kwargs`: Parameters to be provided to the `get_embeddings` function. Useful to change the `target_batch_size_length` or reduce asyncmap tasks `ntasks`.
776
+
- `model`: The model to use for embedding. Default is `PT.MODEL_EMBEDDING`.
777
+
- `tagger`: The tagger logic to use for extracting tags from the chunks. Default is `NoTagger()`, ie, skip tag extraction. There are also `PassthroughTagger` and `OpenTagger`.
778
+
- `tagger_kwargs`: Parameters to be provided to the `get_tags` function.
779
+
- `model`: The model to use for tags extraction. Default is `PT.MODEL_CHAT`.
780
+
- `template`: A template to be used for tags extraction. Default is `:RAGExtractMetadataShort`.
781
+
- `tags`: A vector of vectors of strings directly providing the tags for each chunk. Applicable for `tagger::PasstroughTagger`.
782
+
- `api_kwargs`: Parameters to be provided to the API endpoint. Shared across all API calls if provided.
783
+
- `cost_tracker`: A `Threads.Atomic{Float64}` object to track the total cost of the API calls. Useful to pass the total cost to the parent call.
784
+
785
+
# Returns
786
+
- `PineconeIndex`: An object containing the compiled index of chunks, embeddings, tags, vocabulary, sources and metadata, together with the Pinecone connection data.
0 commit comments