22 Dec 20:54

467eed4

0.2.0 Latest

Latest

Monarch Release Notes

Overview

This release focuses on correctness, robustness, and operational maturity. Major improvements span supervision and shutdown semantics, logging and observability, Kubernetes readiness, SPMD workflows, test hygiene, and build compatibility. Monarch is now more predictable under failure, easier to debug, and better suited for long-running and large-scale deployments.

Supervision & Shutdown

Actor supervision and shutdown behavior has been significantly hardened and clarified.

Key Improvements

Strict supervision hierarchy
- Every actor or process has exactly one parent (except the root).
- Child actors can no longer persist after their parent faults or stops.
Reliable recursive shutdown
- Asking an actor to stop deterministically stops its entire subtree.
- Shutdown cases are documented, tested, and log spam has been audited.
Improved fault propagation
- Supervision errors now describe the full hierarchy of exits.
- Endpoint failures surface clearer context, including actor and endpoint names.
HostMesh lifecycle control
- HostMesh can be cleanly stopped (disconnect clients and kill workers).
- HostMesh can be force-killed, causing worker loops to exit immediately.
- Persistent allocations remain usable for reconnects after stop.

Logging

Logging has been refactored to improve clarity, reduce noise, and clearly separate user-facing signals from system internals.

Key Improvements

Clear separation of logs
- Monarch system logs and user logs are cleanly separated.
- User-visible faults are communicated only via exceptions and supervision events.
Improved error clarity
- Errors are categorized (e.g., user, system, infrastructure).
- Actor names are reported in user-understandable syntax.
- Actor failure reports include richer context and causal chaining.
Structured logging
- Errors emit structured log records suitable for filtering and aggregation.
- Supervision events follow a defined schema.
Reduced default noise
- Log forwarding, aggregation, and enrichment are disabled by default.
- Log messages have been audited for signal quality.

Observability

Observability has been expanded across actors, meshes, and endpoints.

Key Improvements

Comprehensive metrics
- Endpoint latency, throughput, payload size, and error counts are universally available.
- Metrics are collected on both client and server sides.
Lifecycle instrumentation
- Actor, process, and mesh state changes emit structured events.
- Supervision events are fully instrumented.
Root-cause visibility
- The first triggering event in a failure cascade is surfaced.
- User-parseable actor IDs are linked to internal actor identifiers.
Tracing
- Distributed spans cover message send and receive paths.
- Traces can be visualized via Perfetto and standard tracing backends.
Performance awareness
- Instrumentation overhead has been reduced and made configurable.

Build Hygiene & Compatibility

Build and dependency management has been simplified.

Key Improvements

RDMA and tensor engine support are dynamically loaded. The same wheel can be installed
Monarch no longer has a binary dependency on PyTorch.
- PyTorch is required only at the Python layer.
- Startup time and binary size are significantly reduced.

Networking

Networking reliability has improved, with a focus on Lightning integration.

Key Improvements

Lightning integration works on HostMesh v1.
Networking behavior is documented and standardized for OSS usage.

Deprecation

Legacy v0 codepath has been removed

Assets 2

22 Oct 05:02

colin2328

v0.1.0

67cf2d6

0.1.0

🦋 Monarch v0.1.0 — Initial Release
We’re excited to announce the first public release of Monarch, a distributed programming framework for PyTorchbuilt around scalable actor messaging and direct memory access.
Monarch brings together ideas from actor-based concurrency, fault-tolerant supervision, and high-performance tensor communication to make distributed training simpler, more explicit, and faster.

🚀 Highlights

Actor-Based Programming for PyTorch
Define Python classes that run remotely as actors, send them messages, and coordinate distributed work using a clean, imperative API.

from monarch.actor import Actor, endpoint, this_host

training_procs = this_host().spawn_procs({"gpus": 8})

class Trainer(Actor):
    @endpoint
    def train(self, step: int): ...

trainers = training_procs.spawn("trainers", Trainer)
trainers.train.call(step=0).get()

Scalable Messaging and Meshes
Actors are organized into meshes — collections that support broadcast, gather, and other scalable communication primitives.
Supervision and Fault Tolerance
Monarch adopts supervision trees for error handling and recovery. Failures propagate predictably, allowing fine-grained restart and robust distributed workflows.
High-Performance RDMA Transfers
Full RDMA integration for CPU and GPU memory via libibverbs, providing zero-copy, one-sided tensor communication across processes and hosts.
Distributed Tensors
Native support for tensors sharded across processes — enabling distributed compute without custom data movement code.

⚠️ Early Development Notice
Monarch is experimental and under active development.
Expect incomplete APIs, rapid iteration, and evolving interfaces.
We welcome contributions — please discuss significant changes or ideas via issues before submitting PRs.

Assets 2

03 Sep 17:15

colin2328

v0.0.0

74647f6

v0.0.0 Pre-release

Pre-release

First Monarch Release!
https://pypi.org/project/torchmonarch/0.0.0/

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Monarch Release Notes

Overview

Supervision & Shutdown

Key Improvements

Logging

Key Improvements

Observability

Key Improvements

Build Hygiene & Compatibility

Key Improvements

Networking

Key Improvements

Deprecation

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!

Releases: meta-pytorch/monarch

0.2.0

Monarch Release Notes

Overview

Supervision & Shutdown

Key Improvements

Logging

Key Improvements

Observability

Key Improvements

Build Hygiene & Compatibility

Key Improvements

Networking

Key Improvements

Deprecation

Uh oh!

0.1.0

Uh oh!

v0.0.0

Uh oh!