Contains Retrieval Augmented Generation related utilities for Azure Machine Learning and OSS interoperability.
Project description
AzureML Retrieval Augmented Generation Utilities
This package is in alpha stage at the moment, use at risk of breaking changes and unstable behavior.
It contains utilities for:
- Processing text documents into chunks appropriate for use in LLM prompts, with metadata such is source url.
- Embedding chunks with OpenAI or HuggingFace embeddings models, including the ability to update a set of embeddings over time.
- Create MLIndex artifacts from embeddings, a yaml file capturing metadata needed to deserialize different kinds of Vector Indexes for use in langchain. Supported Index types:
- FAISS index (via langchain)
- Azure Cognitive Search index
Getting started
You can install AzureMLs RAG package using pip.
pip install azureml-rag
There are various extra installs you probably want to include based on intended use:
faiss
: When using FAISS based Vector Indexescognitive_search
: When using Azure Cognitive Search Indexeshugging_face
: When using Sentence Transformer embedding models from HuggingFace (local inference)document_parsing
: When cracking and chunking documents locally to put in an Index
MLIndex
MLIndex files describe an index of data + embeddings and the embeddings model used in yaml.
embeddings:
dimension: 768
kind: hugging_face
model: sentence-transformers/all-mpnet-base-v2
schema_version: '2'
index:
api_version: 2021-04-30-Preview
connection:
id: /subscriptions/<subscription_id>/resourceGroups/<resource_group>/providers/Microsoft.MachineLearningServices/workspaces/<workspace>/connections/<acs_connection_name>
connection_type: workspace_connection
endpoint: https://<acs_name>.search.windows.net
engine: azure-sdk
field_mapping:
content: content
filename: sourcefile
metadata: meta_json_string
title: title
url: sourcepage
embedding: content_vector_hugging_face
index: azureml-rag-test-206e03b6-3880-407b-9bc4-c0a1162d6c70
kind: acs
Create MLIndex
Examples using MLIndex remotely with AzureML and locally with langchain live here: https://github.com/Azure/azureml-examples/tree/main/sdk/python/generative-ai/rag
Consume MLIndex
from azureml.rag.mlindex import MLIndex
retriever = MLIndex(uri_to_folder_with_mlindex).as_langchain_retriever()
retriever.get_relevant_documents('What is an AzureML Compute Instance?')
Changelog
0.2.3
- Fix git clone url format bug
0.2.2
- Fix all langchain splitter to use tiktoken in an airgap friendly way.
0.2.1
- Introduce DataIndex interface for scheduling Vector Index Pipeline in AzureML and creating MLIndex Assets
- Vendor various langchain components to avoid breaking changes to MLIndex internal logic
0.1.24.2
- Fix all langchain splitter to use tiktoken in an airgap friendly way.
0.1.24.1
- Fix subsplitter init bug in MarkdownHeaderSplitter
- Support getting langchain retriever for ACS based MLIndex with embeddings.kind: none.
0.1.24
- Don't mlflow log unless there's an active mlflow run.
- Support
langchain.vectorstores.azuresearch
afterlangchain>=0.0.273
upgraded toazure-search-documents==11.4.0b8
- Use tiktoken encodings from package for other splitter types
0.1.23.2
- Handle
Path
objects passed intoMLIndex
init.
0.1.23.1
- Handle .api.cognitive style aoai endpoints correctly
0.1.23
- Ensure tiktoken encodings are packaged in wheel
0.1.22
- Set environment variables to pull encodings files from directory with cache key to avoid tiktoken external network call
- Fix mlflow log error when there's no files input
0.1.21
- Fix top level imports in
update_acs
task failing without helpful reason when oldazure-search-documents
is installed.
0.1.20
- Fix Crack'n'Chunk race-condition where same named files would overwrite each other.
0.1.19
- Various bug fixes:
- Handle some malformed git urls in
git_clone
task - Try fall back when parsing csv with pandas fails
- Allow chunking special tokens
- Ensure logging with mlflow can't fail a task
- Handle some malformed git urls in
- Update to support latest
azure-search-documents==11.4.0b8
0.1.18
- Add FaissAndDocStore and FileBasedDocStore which closely mirror langchains' FAISS and InMemoryDocStore without the langchain or pickle dependency. These are default not used until PromptFlow support has been added.
- Pin
azure-documents-search==11.4.0b6
as there's breaking changes in11.4.0b7
and11.4.0b8
0.1.17
- Update interactions with Azure Cognitive Search to use latest azure-documents-search SDK
0.1.16
- Convert api_type from Workspace Connections to lower case to appease langchains case sensitive checking.
0.1.15
- Add support for custom loaders
- Added logging for MLIndex.init to understand usage of MLindex
0.1.14
- Add Support for CustomKeys connections
- Add OpenAI support for QA Gen and Embeddings
0.1.13 (2023-07-12)
- Implement single node non-PRS embed task to enable clearer logs for users.
0.1.12 (2023-06-29)
- Fix casing check of ApiVersion, ApiType in infer_deployment util
0.1.11 (2023-06-28)
- Update casing check for workspace connection ApiVersion, ApiType
- int casting for temperature, max_tokens
0.1.10 (2023-06-26)
- Update data asset registering to have adjustable output_type
- Remove asset registering from generate_qa.py
0.1.9 (2023-06-22)
- Add
azureml.rag.data_generation
module. - Fixed bug that would cause crack_and_chunk to fail for documents that contain non-utf-8 characters. Currently these characters will be ignored.
- Improved heading extraction from Markdown files. When
use_rcts=False
Markdown files will be split on headings and each chunk with have the heading context up to the root as a prefix (e.g.# Heading 1\n## Heading 2\n# Heading 3\n{content}
)
0.1.8 (2023-06-21)
- Add deployment inferring util for use in azureml-insider notebooks.
0.1.7 (2023-06-08)
- Improved telemetry for tasks (used in RAG Pipeline Components)
0.1.6 (2023-05-31)
- Fail crack_and_chunk task when no files were processed (usually because of a malformed
input_glob
) - Change
update_acs.py
to defaultpush_embeddings=True
instead ofFalse
.
0.1.5 (2023-05-19)
- Add api_base back to MLIndex embeddings config for back-compat (until all clients start getting it from Workspace Connection).
- Add telemetry for tasks used in pipeline components, not enabled by default for SDK usage.
0.1.4 (2023-05-17)
- Fix bug where enabling rcts option on split_documents used nltk splitter instead.
0.1.3 (2023-05-12)
- Support Workspace Connection based auth for Git, Azure OpenAI and Azure Cognitive Search usage.
0.1.2 (2023-05-05)
- Refactored document chunking to allow insertion of custom processing logic
0.0.1 (2023-04-25)
Features Added
- Introduced package
- langchain Retriever for Azure Cognitive Search
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file azureml_rag-0.2.3-py3-none-any.whl
.
File metadata
- Download URL: azureml_rag-0.2.3-py3-none-any.whl
- Upload date:
- Size: 1.6 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.9.6 requests/2.31.0 setuptools/50.3.2 requests-toolbelt/1.0.0 tqdm/4.66.1 CPython/3.8.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7c19f9afd33553d7090819eba0a0719a5f7e1baa90ef6149668276ea00799a86 |
|
MD5 | f22a5bacbe65d42377af02bb9641db23 |
|
BLAKE2b-256 | d0916685e732506fdb3cef7ef15166a198ac8c7041a232949e74bf4d13d5d907 |