Skip to content

Vannanaaina/oss di to cu migration tool #29

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 53 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
05f90e2
added di to cu migration tool code
May 14, 2025
47eba35
Update README.md - need to finish things of note
aainav269 May 14, 2025
c603fc7
Update README.md
aainav269 May 14, 2025
22abdc6
Update README.md
aainav269 May 14, 2025
a113da4
added sample documents
May 14, 2025
ec315f4
removed new data
May 14, 2025
c49a88a
changed analyze to use key and not id
May 14, 2025
bd8af1d
resolving PR comments
May 14, 2025
4ba72c7
changed outputFile to output_file
May 15, 2025
6a3ebbb
removed content understanding from host, as customers might not know …
May 16, 2025
c770f1f
resolved Tim's comments
May 19, 2025
3739db2
resolving Thomas' comments
May 22, 2025
47a6126
made changes to implementation per Paul's comments
May 22, 2025
9df1d42
Update README.md
aainav269 May 22, 2025
a3b40a0
Update README.md
aainav269 May 22, 2025
2178230
analyzer_result.json is a CLI variable
May 22, 2025
8b81533
Merge branch 'vannanaaina/oss-di-to-cu-migration-tool' of https://git…
May 22, 2025
aef0569
Update README.md
aainav269 May 23, 2025
ab35592
Update README.md
aainav269 May 23, 2025
b1b92eb
Update README.md
aainav269 May 23, 2025
2f0c5b7
Update README.md
aainav269 May 23, 2025
724bb36
Update README.md
aainav269 May 23, 2025
797418c
Update README.md
aainav269 May 28, 2025
6a0381c
added assets
May 28, 2025
420f9d3
Update README.md
aainav269 May 28, 2025
73c7281
updated endpoint
May 28, 2025
dd918b3
Update README.md
aainav269 May 28, 2025
e7eb604
updated endpoint
May 28, 2025
cfe7f06
Update README.md
aainav269 May 28, 2025
b57d76e
added in other images
May 28, 2025
114416b
Update README.md
aainav269 May 28, 2025
d277e17
Update README.md
aainav269 May 28, 2025
c4bf717
Update README.md
aainav269 May 28, 2025
62b9642
updated to show no subscription id
May 28, 2025
893efe5
Update README.md
aainav269 May 28, 2025
59f9448
using analyzer for individual file now
May 28, 2025
3b2ed72
Merge branch 'vannanaaina/oss-di-to-cu-migration-tool' of https://git…
May 28, 2025
3282aab
Update README.md
aainav269 May 28, 2025
01351fe
Update README.md
aainav269 May 28, 2025
456b56f
Update README.md
aainav269 May 28, 2025
9cd0b1d
adjusting storage browser
May 28, 2025
0cfb83d
Update README.md
aainav269 May 30, 2025
5404072
Update README.md
aainav269 May 30, 2025
085f6ee
updated endpoint
Jun 2, 2025
1b8a5e9
Update README.md
aainav269 Jun 2, 2025
00124a1
updated generate sas
Jun 2, 2025
3894dd3
Merge branch 'vannanaaina/oss-di-to-cu-migration-tool' of https://git…
Jun 2, 2025
bd8d1c9
Update README.md
aainav269 Jun 2, 2025
f3a63e8
Update README.md
aainav269 Jun 2, 2025
a4dd8b2
Update README.md
aainav269 Jun 2, 2025
0b38d3b
Update README.md
aainav269 Jun 2, 2025
99f80ca
Update README.md
aainav269 Jun 2, 2025
920d13b
Update README.md
aainav269 Jun 3, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions python/di_to_cu_migration_tool/.sample_env
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Rename to .env
HOST="<fill in your target endpoint here>"

API_VERSION = "2025-05-01-preview"

SUBSCRIPTION_KEY = "<fill in your API Key here>" # This is your API Key if you have one or can be your Subscription ID
154 changes: 154 additions & 0 deletions python/di_to_cu_migration_tool/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
# Document Intelligence to Content Understanding Migration Tool (Python)

Welcome! We've created this tool to help convert your Document Intelligence (DI) datasets to Content Understanding (CU) **Preview.2** 2025-05-01-preview format, as seen in AI Foundry. The following DI versions are supported:
- Custom Extraction Model DI 3.1 GA (2023-07-31) to DI 4.0 GA (2024-11-30) (seen in Document Intelligence Studio) --> DI-version = neural
- Document Field Extraction Model 4.0 Preview (2024-07-31-preview) (seen in AI Foundry/AI Services/Vision + Document/Document Field Extraction) --> DI-version = generative

To help you identify which version of Document Intelligence your dataset is in, please consult the sample documents provided under this folder to determine which format matches that of yours. Additionally, you can also identify the version through your DI project's UX as well. For instance, Custom Extraction DI 3.1/4.0 GA is a part of Document Intelligence Studio (i.e., https://documentintelligence.ai.azure.com/studio) and Document Field Extraction DI 4.0 Preview is only available on Azure AI Foundry as a preview service (i.e., https://ai.azure.com/explore/aiservices/vision/document/extraction).

For migration from these DI versions to Content Understanding Preview.2, this tool first needs to convert the DI dataset to a CU compatible format. Once converted, you have the option to create a Content Understanding Analyzer, which will be trained on the converted CU dataset. Additionally, you can further test this model to ensure its quality.

## Details About the Tools
To provide you with some further details, here is a more intricate breakdown of each of the 3 CLI tools and their capabilities:
* **di_to_cu_converter.py**:
* This CLI tool conducts your first step of migration. The tool refers to your labelled Document Intelligence dataset and converts it into a CU format compatible dataset. Through this tool, we map the following files accordingly: fields.json to analyzer.json, DI labels.json to CU labels.json, and ocr.json to result.json.
* Depending on the DI version you wish to migrate from, we use [cu_converter_neural.py](cu_converter_neural.py) and [cu_converter_generative.py](cu_converter_generative.py) accordingly to convert your fields.json and labels.json files.
* For OCR conversion, the tool creates a sample CU analyzer to gather raw OCR results via an Analyze request for each original file in the DI dataset. Additionally, since the sample analyzer contains no fields, we get the results.json files without any fields as well. For more details, please refer to [get_ocr.py](get_ocr.py).
* **create_analyzer.py**:
* Once the dataset is converted to CU format, this CLI tool creates a CU analyzer while referring to the converted dataset.
* **call_analyze.py**:
* This CLI tool can be used to ensure that the migration has successfully completed and to test the quality of the previously created analyzer.

## Setup
To set up this tool, you will need to do the following steps:
1. Run the requirements.txt file to install the needed dependencies via **pip install -r ./requirements.txt**
2. Rename the file **.sample_env** to **.env**
3. Replace the following values in the **.env** file:
- **HOST:** Update this to your Azure AI service endpoint.
- Ex: "https://sample-azure-ai-resource.services.ai.azure.com"
- Avoid the "/" at the end.
![Alt text](assets/sample-azure-resource.png "Azure AI Service")
![Alt text](assets/endpoint.png "Azure AI Service Endpoints")
- **SUBSCRIPTION_KEY:** Update this to your Azure AI Service's API Key or Subscription ID to identify and authenticate the API request.
- You can locate your API KEY here: ![Alt text](assets/endpoint-with-keys.png "Azure AI Service Endpoints With Keys")
- If you are using AAD, please refer to your Subscription ID: ![Alt text](assets/subscription-id.png "Azure AI Service Subscription ID")
- **API_VERSION:** This version ensures that you are converting the dataset to CU Preview.2. No changes are needed here.

## How to Locate Your Document Field Extraction Dataset for Migration
To migrate your Document Field Extraction dataset from AI Foundry, please follow the steps below:
1. On the bottom left of your Document Field Extraction project page, please select "Management Center."
![Alt text](assets/management-center.png "Management Center")
2. Now on the Management Center page, please select "View All" from the Connected Resources section.
![Alt text](assets/connected-resources.png "Connected Resources")
3. Within these resources, look for the resource with type "Azure Blob Storage." This resource's target URL contains the location of your dataset's storage account (in yellow) and blob container (in blue).
![Alt text](assets/manage-connections.png "Manage Connections")
Using these values, navigate to your blob container. Then, select the "labelingProjects" folder. From there, select the folder with the same name as the blob container. Here, you'll locate all the contents of your project in the "data" folder.

For example, the sample Document Field Extraction project is stored at
![Alt text](assets/azure-portal.png "Azure Portal")

## How to Find Your Source and Target SAS URLs
To run migration, you will need to specify the source SAS URL (location of your Document Intelligence dataset) and target SAS URL (location for your Content Understanding dataset).

To locate the SAS URL for a file or folder for any container URL arguments, please follow these steps:

1. Navigate to your storage account in Azure Portal, and from the left pane, select "Storage Browser."
![Alt text](assets/storage-browser.png "Storage Browser")
2. Select the source/target blob container for either where your DI dataset is present or where your CU dataset will be. Click on the extended menu on the side and select "Generate SAS."
![Alt text](assets/generate-sas.png "Generate SAS")
3. Configure the permissions and expiry for your SAS URL accordingly.

For the DI source dataset, please select these permissions: _**Read & List**_

For the CU target dataset, please select these permissions: _**Read, Add, Create, & Write**_

Once configured, please select "Generate SAS Token and URL" & copy the URL shown under "Blob SAS URL."

![Alt text](assets/generate-sas-pop-up.png "Generate SAS Pop-Up")

Notes:

- Since SAS URL does not point to a specific folder, to ensure the correct path for source and target, please specify the correct dataset folder as --source-blob-folder or --target-blob-folder.
- To get the SAS URL for a single file, navigate to the specific file and repeat the steps above, such as:
![Alt text](assets/individual-file-generate-sas.png "Generate SAS for Individual File")

## How to Run
To run the 3 tools, please refer to the following commands. For better readability, they are split across lines. Please remove this extra spacing before execution.

_**NOTE:** Use "" when entering in a URL._

### 1. Converting Document Intelligence to Content Understanding Dataset

If you are migrating a _DI 3.1/4.0 GA Custom Extraction_ dataset, please run this command:

python ./di_to_cu_converter.py --DI-version neural --analyzer-prefix mySampleAnalyzer
--source-container-sas-url "https://sourceStorageAccount.blob.core.windows.net/sourceContainer?sourceSASToken" --source-blob-folder diDatasetFolderName
--target-container-sas-url "https://targetStorageAccount.blob.core.windows.net/targetContainer?targetSASToken" --target-blob-folder cuDatasetFolderName

For migration of Custom Extraction DI 3.1/4.0 GA, specifying an analyzer prefix is crucial for creating a CU analyzer. Since there's no "doc_type" defined for any identification in the fields.json, the created analyzer will have an analyzer ID of the specified analyzer prefix.

If you are migrating a _DI 4.0 Preview Document Field Extraction_ dataset, please run this command:

python ./di_to_cu_converter.py --DI-version generative --analyzer-prefix mySampleAnalyzer
--source-container-sas-url "https://sourceStorageAccount.blob.core.windows.net/sourceContainer?sourceSASToken" --source-blob-folder diDatasetFolderName
--target-container-sas-url "https://targetStorageAccount.blob.core.windows.net/targetContainer?targetSASToken" --target-blob-folder cuDatasetFolderName

For migration of Document Field Extraction DI 4.0 Preview, specifying an analyzer prefix is optional. However, if you wish to create multiple analyzers from the same analyzer.json, please add an analyzer prefix. If provided, the analyzer ID will become analyzer-prefix_doc-type. Otherwise, it will simply remain as the doc_type in the fields.json.

_**NOTE:** You are only allowed to create one analyzer per analyzer ID._

### 2. Creating an Analyzer

To create an analyzer using the converted CU analyzer.json, please run this command:

python ./create_analyzer.py
--analyzer-sas-url "https://targetStorageAccount.blob.core.windows.net/targetContainer/cuDatasetFolderName/analyzer.json?targetSASToken"
--target-container-sas-url "https://targetStorageAccount.blob.core.windows.net/targetContainer?targetSASToken"
--target-blob-folder cuDatasetFolderName

The analyzer.json file is stored in the specified target blob container and folder. Please get the SAS URL for the analyzer.json file from there.

Additionally, please use the analyzer ID from this output when running the call_analyze.py tool.

Ex:

![Alt text](assets/analyzer.png "Sample Analyzer Creation")

### 3. Running Analyze

To analyze a specific PDF or original file, please run this command:

python ./call_analyze.py --analyzer-id mySampleAnalyzer
--pdf-sas-url "https://storageAccount.blob.core.windows.net/container/folder/sample.pdf?SASToken"
--output-json "./desired-path-to-analyzer-results.json"

For the --analyzer-id argument, please refer to the analyzer ID created in the previous step.
Additionally, specifying --output-json isn't necessary. The default location for the output is "./sample_documents/analyzer_result.json."

## Possible Issues
These are some issues that you might run into when creating an analyzer or running analyze.
### Creating an Analyzer
For any **400** error, please validate the following:
- You are using a valid endpoint. Example: _https://yourEndpoint/contentunderstanding/analyzers/yourAnalyzerID?api-version=2025-05-01-preview_
- Your converted CU dataset may not meet the latest naming constraints. Please ensure that all the fields in your analyzer.json file meet these requirements. If not, please make the changes manually.

- Field name only starts with a letter or an underscore
- Field name length is between 1 and 64 characters
- Only uses letters, numbers, and underscores
- Your analyzer ID meets these naming requirements
- ID is between 1 and 64 characters long
- Only uses letters, numbers, dots, underscores, and hyphens

A **401** error implies a failure in authentication. Please ensure that your API key and/or subscription ID are correct and that you have access to the endpoint specified.

A **409** error implies that the analyzer ID has already been used to create an analyzer. Please try using another ID.
### Calling Analyze
- A **400** error implies a potentially incorrect endpoint or SAS URL. Ensure that your endpoint is valid _(https://yourendpoint/contentunderstanding/analyzers/yourAnalyzerID:analyze?api-version=2025-05-01-preview)_ and that you are using the correct SAS URL for the document under analysis.
- A **401** error implies a failure in authentication. Please ensure that your API key and/or subscription ID are correct and that you have access to the endpoint specified.
- A **404** error implies that no analyzer exists with the analyzer ID you have specified. Mitigate it by calling the correct ID or creating an analyzer with such an ID.

## Points to Note:
1. Make sure to use Python version 3.9 or above.
2. Signature field types (such as in the previous versions of DI) are not supported in Content Understanding yet. Thus, during migration, these signature fields will be ignored when creating the analyzer.
3. The content of training documents will be retained in Content Understanding model metadata, under storage specifically. Additional explanation can be found here: https://learn.microsoft.com/en-us/legal/cognitive-services/content-understanding/transparency-note?toc=%2Fazure%2Fai-services%2Fcontent-understanding%2Ftoc.json&bc=%2Fazure%2Fai-services%2Fcontent-understanding%2Fbreadcrumb%2Ftoc.json
5. All the data conversion will be for Content Understanding preview.2 version only.
3 changes: 3 additions & 0 deletions python/di_to_cu_migration_tool/assets/analyzer.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions python/di_to_cu_migration_tool/assets/azure-portal.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions python/di_to_cu_migration_tool/assets/connected-resources.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions python/di_to_cu_migration_tool/assets/endpoint-with-keys.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions python/di_to_cu_migration_tool/assets/endpoint.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions python/di_to_cu_migration_tool/assets/generate-sas-pop-up.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions python/di_to_cu_migration_tool/assets/generate-sas.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions python/di_to_cu_migration_tool/assets/manage-connections.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions python/di_to_cu_migration_tool/assets/management-center.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions python/di_to_cu_migration_tool/assets/storage-browser.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions python/di_to_cu_migration_tool/assets/subscription-id.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
81 changes: 81 additions & 0 deletions python/di_to_cu_migration_tool/call_analyze.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# imports from built-in packages
from azure.identity import DefaultAzureCredential
from azure.storage.blob import BlobClient
from dotenv import load_dotenv
import json
import os
from pathlib import Path
import requests
import time
import typer

# imports from external packages (in requirements.txt)
from rich import print # For colored output

app = typer.Typer()

@app.command()
def main(
analyzer_id: str = typer.Option(..., "--analyzer-id", help="Analyzer ID to use for the analyze API"),
pdf_sas_url: str = typer.Option(..., "--pdf-sas-url", help="SAS URL for the PDF file to analyze"),
output_json: str = typer.Option("./sample_documents/analyzer_result.json", "--output-json", help="Output JSON file for the analyze result")
):
"""
Main function to call the analyze API
"""
assert analyzer_id != "", "Please provide the analyzer ID to use for the analyze API"
assert pdf_sas_url != "", "Please provide the SAS URL for the PDF file you wish to analyze"

load_dotenv()
# Request Header - Content-Type
# Acquire a token for the desired scope
credential = DefaultAzureCredential()
token = credential.get_token("https://cognitiveservices.azure.com/.default")

# Extract the access token
access_token = token.token
subscription_key = os.getenv("SUBSCRIPTION_KEY")
headers = {
"Authorization": f"Bearer {access_token}",
"Ocp-Apim-Subscription-Key": f"{subscription_key}",
"Content-Type": "application/pdf"
}

host = os.getenv("HOST")
api_version = os.getenv("API_VERSION")
endpoint = f"{host}/contentunderstanding/analyzers/{analyzer_id}:analyze?api-version={api_version}"

blob = BlobClient.from_blob_url(pdf_sas_url)
blob_data = blob.download_blob().readall()
response = requests.post(url=endpoint, data=blob_data, headers=headers)

response.raise_for_status()
print(f"[yellow]Analyzing file {pdf_sas_url} with analyzer {analyzer_id}[/yellow]")

operation_location = response.headers.get("Operation-Location", None)
if not operation_location:
print("Error: 'Operation-Location' header is missing.")

while True:
poll_response = requests.get(operation_location, headers=headers)
poll_response.raise_for_status()

result = poll_response.json()
status = result.get("status", "").lower()

if status == "succeeded":
print(f"[green]Successfully analyzed file {pdf_sas_url} with analyzer ID of {analyzer_id}.[/green]\n")
analyze_result_file = Path(output_json)
with open(analyze_result_file, "w") as f:
json.dump(result, f, indent=4)
print(f"[green]Analyze result saved to {analyze_result_file}[/green]")
break
elif status == "failed":
print(f"[red]Failed: {result}[/red]")
break
else:
print(".", end="", flush=True)
time.sleep(0.5)

if __name__ == "__main__":
app()
66 changes: 66 additions & 0 deletions python/di_to_cu_migration_tool/constants.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Supported DI versions
DI_VERSIONS = ["generative", "neural"]
CU_API_VERSION = "2025-05-01-preview"

# constants
MAX_FIELD_COUNT = 100
MAX_FIELD_LENGTH = 64

# standard file names
FIELDS_JSON = "fields.json"
LABELS_JSON = ".labels.json"
VALIDATION_TXT = "validation.txt"
PDF = ".pdf"
OCR_JSON = ".ocr.json"

# for field type conversion
SUPPORT_FIELD_TYPE = [
"string",
"number",
"integer",
"array",
"object",
"date",
"time",
"boolean",
]

CONVERT_TYPE_MAP = {
"selectionMark": "boolean",
"currency": "number",
}

FIELD_VALUE_MAP = {
"number": "valueNumber",
"integer": "valueInteger",
"date": "valueDate",
"time": "valueTime",
"selectionMark": "valueSelectionMark",
"address": "valueAddress",
"phoneNumber": "valuePhoneNumber",
"currency": "valueCurrency",
"string": "valueString",
"boolean": "valueBoolean",
}

CHECKED_SYMBOL = "☒"
UNCHECKED_SYMBOL = "☐"

# for CU conversion
# spec for valid field types
VALID_CU_FIELD_TYPES = {
"string": "valueString",
"date": "valueDate",
"phoneNumber": "valuePhoneNumber",
"integer": "valueInteger",
"number": "valueNumber",
"array": "valueArray",
"object": "valueObject",
"boolean": "valueBoolean",
"time": "valueTime",
"selectionMark": "valueSelectionMark" # for DI only
}

DATE_FORMATS_SLASHED = ["%d/%m/%y", "%m/%d/%y", "%y/%m/%d","%d/%m/%Y", "%m/%d/%Y", "%Y/%m/%d"] # %Y is for 4-year format (Ex: 2015) and %y is for 2-year format (Ex: 15)
DATE_FORMATS_DASHED = ["%d-%m-%y", "%m-%d-%y", "%y-%m-%d","%d-%m-%Y", "%m-%d-%Y", "%Y-%m-%d"] # can have dashes, instead of slashes
COMPLETE_DATE_FORMATS = DATE_FORMATS_SLASHED + DATE_FORMATS_DASHED # combine the two formats
Loading