Skip to content

Update Skypilot orchestrator settings and features #3612

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 47 commits into from
May 28, 2025
Merged
Show file tree
Hide file tree
Changes from 40 commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
ec907fc
Update Skypilot orchestrator settings and features
htahir1 Apr 28, 2025
333127e
Add support for arbitrary task, resource, launch settings
htahir1 Apr 28, 2025
524d7f7
Merge branch 'develop' into feature/updateskypilot
htahir1 Apr 28, 2025
ba5560a
Add Field annotations to Skypilot Orchestrator Base VM config
htahir1 May 4, 2025
97f3ad8
Refactor Skypilot orchestrator utility functions
htahir1 May 4, 2025
ed78460
Fix typo in sanitize_cluster_name test expected output
htahir1 May 4, 2025
4aa8599
Update Skypilot integrations to version 0.9.2
htahir1 May 4, 2025
89ac93e
Remove unnecessary lines in Skypilot integrations
htahir1 May 4, 2025
5062c4f
Add use_sudo parameter to prepare_docker_setup().
htahir1 May 4, 2025
97a6e22
Merge remote-tracking branch 'origin/develop' into feature/updateskyp…
htahir1 May 4, 2025
c057a59
Update Skypilot integrations with omegaconf>=2.4.0.dev3
htahir1 May 4, 2025
90155a1
Update omegaconf requirement to include version 2.3
htahir1 May 4, 2025
3787942
Add Skypilot orchestrator utility functions
htahir1 May 4, 2025
3cd356c
Update Skypilot integrations requirements
htahir1 May 5, 2025
9074d93
Update Skypilot VM Orchestrator installation command
htahir1 May 5, 2025
7a7ebaf
Update Skypilot-VM integration instructions
htahir1 May 5, 2025
4902413
Add installation instructions for Skypilot AWS dependencies
htahir1 May 5, 2025
913c34c
Merge branch 'develop' into feature/updateskypilot
bcdurak May 12, 2025
c6c9b23
fix tests
bcdurak May 12, 2025
67489f8
fix docs
bcdurak May 12, 2025
b1c93e8
fix code
bcdurak May 12, 2025
645ba8c
Merge branch 'develop' into feature/updateskypilot
bcdurak May 12, 2025
c190489
fixed reason
bcdurak May 12, 2025
491b666
skypilot adjustments
bcdurak May 13, 2025
df23feb
Merge branch 'develop' into feature/updateskypilot
bcdurak May 13, 2025
a59dea1
formatting and lintingq
bcdurak May 13, 2025
04b004b
new docs
bcdurak May 13, 2025
36cd269
spellchecker
bcdurak May 13, 2025
aff0c84
Merge branch 'develop' into feature/updateskypilot
bcdurak May 13, 2025
73351cb
update the requirements
bcdurak May 22, 2025
6233e76
merged develop
bcdurak May 22, 2025
6b364b4
updated the docs
bcdurak May 22, 2025
773d989
even more docs updates
bcdurak May 22, 2025
1a9ee5c
more docs fixes
bcdurak May 22, 2025
cb65c9f
removed skypilot tests
bcdurak May 22, 2025
1470dd8
adjusting skypilot gcp reqs
bcdurak May 22, 2025
2e8a28b
shorter installation time again
bcdurak May 22, 2025
909498a
new async changes
bcdurak May 23, 2025
c43bd6a
handling the logs
bcdurak May 23, 2025
f74d98f
checkpoint
bcdurak May 26, 2025
126b23d
another checkpoint
bcdurak May 26, 2025
1ef18d7
Merge branch 'develop' into feature/updateskypilot
bcdurak May 27, 2025
9e7146f
merged develop again
bcdurak May 27, 2025
053b127
final changes
bcdurak May 27, 2025
001468e
docstrings and spellchecker
bcdurak May 27, 2025
3347ecc
one final touch
bcdurak May 28, 2025
757bc26
fix docs
bcdurak May 28, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 45 additions & 25 deletions docs/book/component-guide/orchestrators/skypilot-vm.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,28 +40,13 @@ All ZenML pipeline runs are executed using Docker containers within the VMs prov
You don't need to do anything special to deploy the SkyPilot VM Orchestrator. As the SkyPilot integration itself takes care of provisioning VMs, you can simply use the orchestrator as you would any other ZenML orchestrator. However, you will need to ensure that you have the appropriate permissions to provision VMs on your cloud provider of choice and to configure your SkyPilot orchestrator accordingly using the [service connectors](https://docs.zenml.io/how-to/infrastructure-deployment/auth-management/service-connectors-guide) feature.

{% hint style="info" %}
The SkyPilot VM Orchestrator currently only supports the AWS, GCP, and Azure cloud platforms.
The SkyPilot VM Orchestrator currently only supports the AWS, GCP, Azure, Lambda Labs and Kubernetes platforms.
{% endhint %}

## How to use it

To use the SkyPilot VM Orchestrator, you need:

* One of the SkyPilot integrations installed. You can install the SkyPilot integration for your cloud provider of choice using the following command:

```shell
# For AWS
pip install "zenml[connectors-aws]"
zenml integration install aws skypilot_aws

# for GCP
pip install "zenml[connectors-gcp]"
zenml integration install gcp skypilot_gcp # for GCP

# for Azure
pip install "zenml[connectors-azure]"
zenml integration install azure skypilot_azure # for Azure
```
* [Docker](https://www.docker.com) installed and running.
* A [remote artifact store](https://docs.zenml.io/stacks/artifact-stores/) as part of your stack.
* A [remote container registry](https://docs.zenml.io/stacks/container-registries/) as part of your stack.
Expand All @@ -71,11 +56,12 @@ To use the SkyPilot VM Orchestrator, you need:

{% tabs %}
{% tab title="AWS" %}
We need first to install the SkyPilot integration for AWS and the AWS connectors extra, using the following two commands:
We need first to install the SkyPilot integration for AWS and the AWS connectors extra, using the following commands:

```shell
# Installs dependencies for Skypilot AWS, AWS Container Registry, and S3 Artifact Store
pip install "zenml[connectors-aws]"
zenml integration install aws skypilot_aws
zenml integration install aws skypilot_aws # We recommend using the --uv option here
```

To provision VMs on AWS, your VM Orchestrator stack component needs to be configured to authenticate with [AWS Service Connector](https://docs.zenml.io/how-to/infrastructure-deployment/auth-management/aws-service-connector). To configure the AWS Service Connector, you need to register a new service connector configured with AWS credentials that have at least the minimum permissions required by SkyPilot as documented [here](https://skypilot.readthedocs.io/en/latest/cloud-setup/cloud-permissions/aws.html).
Expand Down Expand Up @@ -175,13 +161,33 @@ zenml stack register <STACK_NAME> -o <ORCHESTRATOR_NAME> ... --set
{% endtab %}

{% tab title="Azure" %}
We need first to install the SkyPilot integration for Azure and the Azure extra for ZenML, using the following two commands
We need first to install the SkyPilot integration for Azure and the extra requirements that are needed from additional Azure components, using the following two commands

{% hint style="warning" %}
Currently, the ZenML Skypilot integration is **pip-incompatible** with the ZenML Azure integration, therefore executing `zenml integration install azure skypilot_azure` will not work.

Since working with a skypilot stack requires you to use a remote artifact store and container registry, please install the requirements of these components with pip to avoid any installation problems.
{% endhint %}

```shell
pip install "zenml[connectors-azure]"
zenml integration install azure skypilot_azure
pip install "zenml[connectors-azure]" adlfs azure-mgmt-containerservice azure-storage-blob
```

{% hint style="warning" %}
If you would like to use `uv` to install the stack requirements for an Azure Skypilot Stack, you need to use `python_package_installer_args={"prerelease": "allow"}`:

```python
docker_settings = DockerSettings(
python_package_installer=PythonPackageInstaller.UV,
python_package_installer_args={"prerelease": "allow"},
)

@pipeline(settings={"docker": docker_settings})
def basic_pipeline():
...
```
{% endhint %}

To provision VMs on Azure, your VM Orchestrator stack component needs to be configured to authenticate with [Azure Service Connector](https://docs.zenml.io/how-to/infrastructure-deployment/auth-management/azure-service-connector)

To configure the Azure Service Connector, you need to register a new service connector, but first let's check the available service connectors types using the following command:
Expand Down Expand Up @@ -314,6 +320,18 @@ For additional configuration of the Skypilot orchestrator, you can pass `Setting
* `down`: Tear down the cluster after all jobs finish (successfully or abnormally). If `idle_minutes_to_autostop` is also set, the cluster will be torn down after the specified idle time. Note that if errors occur during provisioning/data syncing/setting up, the cluster will not be torn down for debugging purposes.
* `stream_logs`: If True, show the logs in the terminal as they are generated while the cluster is running.
* `docker_run_args`: Additional arguments to pass to the `docker run` command. For example, `['--gpus=all']` to use all GPUs available on the VM.
* `ports`: Ports to expose. Could be an integer, a range, or a list of integers and ranges. All ports will be exposed to the public internet.
* `labels`: Labels to apply to instances as key-value pairs. These are mapped to cloud-specific implementations (instance tags in AWS, instance labels in GCP, etc.).
* `any_of`: List of candidate resources to try in order of preference based on cost (determined by the SkyPilot optimizer).
* `ordered`: List of candidate resources to try in the specified order.
* `workdir`: Working directory on the local machine to sync to the VM. This is synced to `~/sky_workdir` inside the VM.
* `task_name`: Human-readable task name shown in SkyPilot for display purposes.
* `num_nodes`: Number of nodes to launch (including the head node).
* `file_mounts`: File and storage mounts configuration to make local or cloud storage paths available inside the remote cluster.
* `envs`: Environment variables for the task. Accessible in the SkyPilot setup/run phases and inside your pipeline steps.
* `task_settings`: Dictionary of arbitrary settings forwarded to `sky.Task()`. This allows passing future parameters added by SkyPilot without requiring updates to ZenML.
* `resources_settings`: Dictionary of arbitrary settings forwarded to `sky.Resources()`. This allows passing future parameters added by SkyPilot without requiring updates to ZenML.
* `launch_settings`: Dictionary of arbitrary settings forwarded to `sky.launch()`. This allows passing future parameters added by SkyPilot without requiring updates to ZenML.

The following code snippets show how to configure the orchestrator settings for each cloud provider:

Expand All @@ -340,7 +358,7 @@ skypilot_settings = SkypilotAWSOrchestratorSettings(
retry_until_up=True,
idle_minutes_to_autostop=60,
down=True,
stream_logs=True
stream_logs=True,
docker_run_args=["--gpus=all"]
)

Expand Down Expand Up @@ -376,7 +394,8 @@ skypilot_settings = SkypilotGCPOrchestratorSettings(
retry_until_up=True,
idle_minutes_to_autostop=60,
down=True,
stream_logs=True
stream_logs=True,
docker_run_args=["--gpus=all"]
)


Expand Down Expand Up @@ -410,7 +429,8 @@ skypilot_settings = SkypilotAzureOrchestratorSettings(
retry_until_up=True,
idle_minutes_to_autostop=60,
down=True,
stream_logs=True
stream_logs=True,
docker_run_args=["--gpus=all"]
)


Expand Down Expand Up @@ -462,7 +482,7 @@ skypilot_settings = SkypilotKubernetesOrchestratorSettings(
disk_size=100,
cluster_name="my_cluster",
retry_until_up=True,
stream_logs=True
stream_logs=True,
docker_run_args=["--gpus=all"]
)

Expand Down
2 changes: 1 addition & 1 deletion scripts/install-zenml-dev.sh
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ install_integrations() {
# figure out the python version
python_version=$(python -c "import sys; print('.'.join(map(str, sys.version_info[:2])))")

ignore_integrations="feast label_studio bentoml seldon pycaret skypilot_aws skypilot_gcp skypilot_azure pigeon prodigy argilla"
ignore_integrations="feast label_studio bentoml seldon pycaret skypilot_aws skypilot_gcp skypilot_azure skypilot_kubernetes skypilot_lambda pigeon prodigy argilla"

# Ignore tensorflow and deepchecks only on Python 3.12
if [ "$python_version" = "3.12" ]; then
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
# permissions and limitations under the License.
"""Skypilot orchestrator base config and settings."""

from typing import Dict, List, Literal, Optional, Union
from typing import Any, Dict, List, Literal, Optional, Union

from pydantic import Field

Expand Down Expand Up @@ -67,6 +67,14 @@ class SkypilotBaseOrchestratorSettings(BaseSettings):
disk_size: the size of the OS disk in GiB.
disk_tier: the disk performance tier to use. If None, defaults to
``'medium'``.
ports: Ports to expose. Could be an integer, a range, or a list of
integers and ranges. All ports will be exposed to the public internet.
labels: Labels to apply to instances as key-value pairs. These are
mapped to cloud-specific implementations (instance tags in AWS,
instance labels in GCP, etc.)
any_of: List of candidate resources to try in order of preference based on
cost (determined by the optimizer).
ordered: List of candidate resources to try in the specified order.

cluster_name: name of the cluster to create/reuse. If None,
auto-generate a name.
Expand All @@ -88,6 +96,20 @@ class SkypilotBaseOrchestratorSettings(BaseSettings):
stream_logs: if True, show the logs in the terminal.
docker_run_args: Optional arguments to pass to the `docker run` command
running inside the VM.
workdir: Working directory to sync to the VM. Synced to ~/sky_workdir.
task_name: Task name used for display purposes.
num_nodes: Number of nodes to launch (including the head node).
file_mounts: File and storage mounts configuration for remote cluster.
envs: Environment variables for the task. Accessible in setup/run.
task_settings: Dictionary of arbitrary settings to pass to sky.Task().
This allows passing future parameters added by SkyPilot without
requiring updates to ZenML.
resources_settings: Dictionary of arbitrary settings to pass to
sky.Resources(). This allows passing future parameters added
by SkyPilot without requiring updates to ZenML.
launch_settings: Dictionary of arbitrary settings to pass to
sky.launch(). This allows passing future parameters added
by SkyPilot without requiring updates to ZenML.
"""

# Resources
Expand All @@ -103,24 +125,45 @@ class SkypilotBaseOrchestratorSettings(BaseSettings):
)
accelerator_args: Optional[Dict[str, str]] = None
use_spot: Optional[bool] = None
job_recovery: Optional[str] = None
job_recovery: Union[None, str, Dict[str, Any]] = Field(
default=None, union_mode="left_to_right"
)
region: Optional[str] = None
zone: Optional[str] = None
image_id: Union[Dict[str, str], str, None] = Field(
default=None, union_mode="left_to_right"
)
disk_size: Optional[int] = None
disk_tier: Optional[Literal["high", "medium", "low"]] = None
disk_tier: Optional[Literal["high", "medium", "low", "ultra", "best"]] = (
None
)

# Run settings
cluster_name: Optional[str] = None
retry_until_up: bool = False
idle_minutes_to_autostop: Optional[int] = 30
down: bool = True
stream_logs: bool = True

docker_run_args: List[str] = []

# Additional SkyPilot features
ports: Union[None, int, str, List[Union[int, str]]] = Field(
default=None, union_mode="left_to_right"
)
labels: Optional[Dict[str, str]] = None
any_of: Optional[List[Dict[str, Any]]] = None
ordered: Optional[List[Dict[str, Any]]] = None
workdir: Optional[str] = None
task_name: Optional[str] = None
num_nodes: Optional[int] = None
file_mounts: Optional[Dict[str, Any]] = None
envs: Optional[Dict[str, str]] = None

# Future-proofing settings dictionaries
task_settings: Dict[str, Any] = {}
resources_settings: Dict[str, Any] = {}
launch_settings: Dict[str, Any] = {}


class SkypilotBaseOrchestratorConfig(
BaseOrchestratorConfig, SkypilotBaseOrchestratorSettings
Expand Down
Loading
Loading