This repository provides a solution for generating and verifying query–function calling pairs used to fine-tune and evaluate function calling models. It comprises three main components:
- Data Generation: Creates synthetic query–answer pairs by leveraging seed QA examples, API function definitions, and dynamic prompt templates.
- Multi-Stage Verification: Uses format and semantic checkers to ensure that generated pairs are structurally correct and semantically aligned.
- Dataset Split & Conversion: Splits data into training, validation (and optionally test) sets and converts them to a PHI fine-tuning format.
All core processing logic is contained in the src directory, which includes modules for generation, verification, format conversion, logging, inference, and custom exception handling. Data Generation and Multi-Stage Verification components design are inspired by APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets. This solution automates the creation of diverse query-function pairs, significantly reducing the need for manual annotation
The system implements a data synthesis pipeline (pipelines/main.yaml
) with three core modules:
- Uses seeded QA examples and API function definitions to generate new synthetic query–function calling pairs.
- The module
src/generate_synthetic_pairs.py
handles prompt creation and asynchronous model calls.
- Format Checker: (
src/format_checker.py
) Validates required arguments, data types, enum values, numeric boundaries, and conditional requirements. - Semantic Checker: (
src/semantic_checker.py
) Uses a separate prompt and asynchronous inference to ensure that the generated calls semantically align with the query.
- Data Split: (
src/split_data.py
) Splits verified data into training and validation sets using stratified sampling. - Chat Message Format Conversion: (
src/apply_chat_message_format.py
) Converts the split data into a format suitable for fine-tuning models. This conversion wraps each data point into a message sequence that includes:- A system prompt containing instructions.
- The user query.
- The assistant’s (converted) function call formatted as a JSON string.
Below is a sample output after conversion (one-line in JSONL file):
{
"messages": [
{ "role": "system", "content": "You are an in-car assistant with a list of tools. ..." },
{ "role": "user", "content": "Play the song 'Starlight' on MusicBox" },
{ "role": "assistant", "content": "[{\"function_name\": \"play_audio_track\", \"arguments\": {\"service\": \"MusicBox\", \"media_type\": \"track\", \"title\": \"Starlight\"}}]" }
]
}
Warning
A lot of LLMs typically adopt restrictive licenses for synthesizing data and using them for fine-tuning another SLM/LLM. For example, as of Mar 6, 2025, Azure OpenAI Service Specific Terms doesn't allow that usage except for certain permitted cases in the terms. Check the terms first if you intend to use Azure OpenAI for data synthesis for fine tuning. This solution is heavily tested with Phi3.5 MoE instruct as it is one of the most capable models that still adopts permissive, MIT license. Check available models on Azure AI Foundry and check license on Azure AI model catalog first and decide what model has acceptable license conditions for your team.
- Visual Studio Code installed
- Dev Containers extension installed
- Docker (Installation Guide)
-
Clone the repository
git clone [email protected]:cse-labs/function-calling-data-synthesizer.git cd function-calling-data-synthesizer code .
-
Copy
.env.example
and rename it to.env
. This file will be automatically loaded into the dev container as environment variables.Important: If you skip this step, you will get an error when trying to build the container.
-
Deploy Azure Machine Learning Workspace and Azure AI Foundry
-
Fill out
WORKSPACE_NAME
andRESOURCE_GROUP
in.env
with your Azure Machine Learning Workspace settings -
Use the Dev Containers: Reopen in Container command from the Command Palette (F1, ⇧⌘P) and open it in a Dev Container
-
Deploy Phi3.5 MoE instruct (the model type we tested this solution with mainly) or any available models on Azure AI model catalog on Azure AI foundry
-
Get deployed model(s) API endpoints and API keys from your deployed model(s) and fill
MODEL_API_BASE_URL[1-]
andMODEL_API_KEY[1-]
with those values. Currentlyinference.py
and.env.example
support only two endpoints but you can add more endpoints if you face token limit. Most models on Azure AI Foundry model catalog don't allow increasing token limit as of Mar 6th, 2025 without going through a support request. Deploying more endpoints is one quick way to address token limit errors. -
Update prompts adopted in
prompts/generator_config.yaml
andprompts/system_prompt.txt
based on your use case. Current example prompt templates show Car AI use case version of prompts.
You need two kinds of input data to be prepared:
-
Function definitions JSON file: OpenAI's Function Spec is supported. Follow OpenAI's API reference and function calling guide to define your own function definitions. An example is located at
data/functions_definition.json
. -
Seed QA (Query and function calling Answer) Dataset JSONL file: This dataset serves as seed few-shot examples for generating new synthetic query-answer pairs. This dataset should be in JSONL format, where every line is a JSON object containing the keys
"query"
and"function_calls"
. You can create your own seed QA dataset or use the provided evaluation dataset example located atdata/examples/seed_qa_dataset.jsonl
.
This section gives you overview of the directory structure of this solution. Only essential files are covered in this structure graph for simplicity. The directory structure is as follows:
├── .devcontainer/ # Dockerfile and dev container configuration
├── .github/ # CI pipeline for Github Actions
├── data/ # Data directory
│ └── examples/ # Example function definition and seed QA data
├── pipelines/ # Pipeline files for AML CLI
├── prompts/ # Prompt files for data synthesis
├── src/ # Source code for data synthesis
├── tests/ # Unit and functional tests
├── .env.example # Environment variable examples
├── .amlignore # Files ignored when uploading to AML pipeline
├── .pre-commit-config.yaml # Pre-commit configuration
├── pyproject.toml # Project python dependencies + python tool configurations
└── README.md # This file
The src
directory contains:
- Generation and Inference Modules: (
generate_synthetic_pairs.py
,inference.py
) for data synthesis via model inference. - Verification Modules: (
format_checker.py
andsemantic_checker.py
) that perform multi-stage validation. - Data Conversion & Splitting Modules: (
apply_chat_message_format.py
,split_data.py
) to prepare data for training. - Utility Modules: (e.g.,
log_handlers.py
,custom_exceptions.py
) used throughout the pipeline.
Run the following command:
python src/generate_synthetic_pairs.py \
--config-path prompts/generator_config.yaml \
--qa-jsonl-path data/YOUR_SEED_QA_DATASET.jsonl \
--function-definitions-path data/YOUR_LOCAL_FUNCTION_DEFINITION.json \
--output-path data/YOUR_OUTPUT_PATH.jsonl
python src/verify_generated_query_answer_pairs.py \
--generated-query-answer-path data/YOUR_GENERATED_QA_INPUT_PATH.jsonl \
--function-definitions-path data/YOUR_LOCAL_FUNCTION_DEFINITION.json \
--verified-query-answer-path data/YOUR_OUTPUT_PATH.jsonl
python src/split_data.py \
--input-file-path data/YOUR_VERIFIED_DATA.jsonl \
--train-output-path data/train.jsonl \
--val-output-path data/validation.jsonl \
--test-size 0.3
python src/apply_chat_message_format.py \
--input-path data/train.jsonl \
--output-path data/fine_tuning_format_conversion.jsonl
This repository includes a complete AML pipeline (see pipelines/main.yaml
). The pipeline runs all steps—from data generation, verification, splitting, to Finetuning format conversion in sequence.
- Create AML custom environment using the Dockerfile:
az login
az ml environment create --file pipelines/create_aml_custom_env.yaml --workspace-name $WORKSPACE_NAME --resource-group $RESOURCE_GROUP
- Register local datasets as Azure Machine Learning Data Assets with the following commands:
az login
az ml data create --file pipelines/register_functions_definition.yaml -w $WORKSPACE_NAME -g $RESOURCE_GROUP
az ml data create --file pipelines/register_seed_qa_dataset.yaml -w $WORKSPACE_NAME -g $RESOURCE_GROUP
To submit a job:
az ml job create --file pipelines/main.yaml --resource-group $RESOURCE_GROUP --workspace-name $WORKSPACE_NAME
For further details, refer to the Azure ML documentation.
We keep the necessary libraries for this repository in pyproject.toml
poetry add package_name
To run the unit tests, execute:
PYTHONPATH=src pytest tests/
Ensure all tests pass before submitting changes.
This solution currently assumes that each query has one intent solved by a single function call. Changing this may require an update to the inference and verification logic