Releases: meta-pytorch/monarch
0.2.0
Monarch Release Notes
Overview
This release focuses on correctness, robustness, and operational maturity. Major improvements span supervision and shutdown semantics, logging and observability, Kubernetes readiness, SPMD workflows, test hygiene, and build compatibility. Monarch is now more predictable under failure, easier to debug, and better suited for long-running and large-scale deployments.
Supervision & Shutdown
Actor supervision and shutdown behavior has been significantly hardened and clarified.
Key Improvements
-
Strict supervision hierarchy
- Every actor or process has exactly one parent (except the root).
- Child actors can no longer persist after their parent faults or stops.
-
Reliable recursive shutdown
- Asking an actor to stop deterministically stops its entire subtree.
- Shutdown cases are documented, tested, and log spam has been audited.
-
Improved fault propagation
- Supervision errors now describe the full hierarchy of exits.
- Endpoint failures surface clearer context, including actor and endpoint names.
-
HostMesh lifecycle control
- HostMesh can be cleanly stopped (disconnect clients and kill workers).
- HostMesh can be force-killed, causing worker loops to exit immediately.
- Persistent allocations remain usable for reconnects after stop.
Logging
Logging has been refactored to improve clarity, reduce noise, and clearly separate user-facing signals from system internals.
Key Improvements
-
Clear separation of logs
- Monarch system logs and user logs are cleanly separated.
- User-visible faults are communicated only via exceptions and supervision events.
-
Improved error clarity
- Errors are categorized (e.g., user, system, infrastructure).
- Actor names are reported in user-understandable syntax.
- Actor failure reports include richer context and causal chaining.
-
Structured logging
- Errors emit structured log records suitable for filtering and aggregation.
- Supervision events follow a defined schema.
-
Reduced default noise
- Log forwarding, aggregation, and enrichment are disabled by default.
- Log messages have been audited for signal quality.
Observability
Observability has been expanded across actors, meshes, and endpoints.
Key Improvements
-
Comprehensive metrics
- Endpoint latency, throughput, payload size, and error counts are universally available.
- Metrics are collected on both client and server sides.
-
Lifecycle instrumentation
- Actor, process, and mesh state changes emit structured events.
- Supervision events are fully instrumented.
-
Root-cause visibility
- The first triggering event in a failure cascade is surfaced.
- User-parseable actor IDs are linked to internal actor identifiers.
-
Tracing
- Distributed spans cover message send and receive paths.
- Traces can be visualized via Perfetto and standard tracing backends.
-
Performance awareness
- Instrumentation overhead has been reduced and made configurable.
Build Hygiene & Compatibility
Build and dependency management has been simplified.
Key Improvements
- RDMA and tensor engine support are dynamically loaded. The same wheel can be installed
- Monarch no longer has a binary dependency on PyTorch.
- PyTorch is required only at the Python layer.
- Startup time and binary size are significantly reduced.
Networking
Networking reliability has improved, with a focus on Lightning integration.
Key Improvements
- Lightning integration works on HostMesh v1.
- Networking behavior is documented and standardized for OSS usage.
Deprecation
Legacy v0 codepath has been removed
0.1.0
π¦ Monarch v0.1.0 β Initial Release
Weβre excited to announce the first public release of Monarch, a distributed programming framework for PyTorchbuilt around scalable actor messaging and direct memory access.
Monarch brings together ideas from actor-based concurrency, fault-tolerant supervision, and high-performance tensor communication to make distributed training simpler, more explicit, and faster.
π Highlights
- Actor-Based Programming for PyTorch
Define Python classes that run remotely as actors, send them messages, and coordinate distributed work using a clean, imperative API.
from monarch.actor import Actor, endpoint, this_host
training_procs = this_host().spawn_procs({"gpus": 8})
class Trainer(Actor):
@endpoint
def train(self, step: int): ...
trainers = training_procs.spawn("trainers", Trainer)
trainers.train.call(step=0).get()
- Scalable Messaging and Meshes
Actors are organized into meshes β collections that support broadcast, gather, and other scalable communication primitives. - Supervision and Fault Tolerance
Monarch adopts supervision trees for error handling and recovery. Failures propagate predictably, allowing fine-grained restart and robust distributed workflows. - High-Performance RDMA Transfers
Full RDMA integration for CPU and GPU memory via libibverbs, providing zero-copy, one-sided tensor communication across processes and hosts. - Distributed Tensors
Native support for tensors sharded across processes β enabling distributed compute without custom data movement code.
Monarch is experimental and under active development.
Expect incomplete APIs, rapid iteration, and evolving interfaces.
We welcome contributions β please discuss significant changes or ideas via issues before submitting PRs.
v0.0.0
First Monarch Release!
https://pypi.org/project/torchmonarch/0.0.0/