Skip to content

refactor(registry): make Registry and Singleton thread-safe#1803

Open
Pouyanpi wants to merge 1 commit into
developfrom
refactor/registry-thread-safety
Open

refactor(registry): make Registry and Singleton thread-safe#1803
Pouyanpi wants to merge 1 commit into
developfrom
refactor/registry-thread-safety

Conversation

@Pouyanpi
Copy link
Copy Markdown
Collaborator

@Pouyanpi Pouyanpi commented Apr 20, 2026

Description

Singleton and Registry have TOCTOU (time-of-check-time-of-use) races.

  • Singleton.__call__: two threads racing MyRegistry() can both see the class missing from _instances and both construct an instance thus producing multiple "singleton" objects. Only one wins the final assignment; work already in flight on the loser holds a stale reference.
  • Registry.add: two threads racing reg.add("x", ...) can both see "x" not in items and both assign so silently accepting duplicate registrations instead of raising ValueError.
  • Registry iteration: a reader iterating for name in reg: while another thread calls reg.add(...) raises RuntimeError: dictionary changed size during iteration.

Impact: duplicate httpx connection pools in LLMFrameworkRegistry, silently lost registrations, crashes in concurrent read/write paths. Affects every consumer of this shared infrastructure — LogAdapterRegistry, EmbeddingProviderRegistry, and the incoming LLMFrameworkRegistry.

Fix

  • Singleton: double checked locking on __call__ using a shared threading.Lock.
  • Registry: instance level threading.RLock guards add, get, reset, and the read paths (list, __len__, __iter__, __contains__, __repr__). __iter__ returns a snapshot (iter(list(self.items))) so iteration is stable even if writers mutates concurrently.

Test plan

Three new tests in tests/test_registry.py::TestThreadSafety: Each test uses a slow path in the vulnerable critical section (sleep in __init__ / validate) to widen the race window so the bug is deterministically reproducible.

Before the fix

Reverting the production code and running the new tests (10 consecutive runs):

FAILED TestThreadSafety::test_singleton_under_concurrent_construction
FAILED TestThreadSafety::test_concurrent_add_no_duplicates
FAILED TestThreadSafety::test_iter_during_concurrent_mutation
3 failed ( all 3 fail reliably on every run)

After the fix

tests/test_registry.py ........... [100%]
11 passed in 0.56s

Consistent across 5 consecutive runs.

Summary by CodeRabbit

  • Bug Fixes

    • Improved thread-safety mechanisms in core components.
  • Tests

    • Added comprehensive concurrency test coverage.

Both Singleton and Registry had TOCTOU races on compound
check-then-mutate operations. Under concurrent construction,
Singleton could produce multiple "singleton" instances; under
concurrent add(), Registry could silently accept duplicate
registrations; Registry iteration during mutation could raise
RuntimeError("dictionary changed size during iteration").

Fixes:
- Singleton: double-checked locking on __call__ using a shared
  threading.Lock. Fast path (already constructed) stays lock-free.
- Registry: instance-level threading.RLock guards add/get/reset
  and the read paths (list, __len__, __iter__, __contains__,
  __repr__). __iter__ returns a snapshot so iteration is stable
  even if writers mutate concurrently.

Tests (tests/test_registry.py::TestThreadSafety):
- Singleton: 16 threads racing construction see exactly one
  instance and exactly one __init__ call (slow __init__ widens
  the race window to reliably expose the bug).
- Registry: 16 threads adding the same name — exactly 1 success,
  15 ValueError (slow validate widens the check-then-set window).
- Registry iteration: reader thread iterates while writer adds
  concurrently; no RuntimeError.

All three tests fail reliably against pre-fix code and pass
consistently with the fix.
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 20, 2026

Codecov Report

❌ Patch coverage is 93.33333% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
nemoguardrails/registry.py 91.66% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@Pouyanpi Pouyanpi marked this pull request as ready for review April 21, 2026 09:20
@Pouyanpi Pouyanpi changed the title refactor(core): make Registry and Singleton thread-safe refactor(registry): make Registry and Singleton thread-safe Apr 21, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 21, 2026

Greptile Summary

This PR fixes three TOCTOU races in Singleton and Registry by introducing double-checked locking (shared threading.Lock) on Singleton.__call__ and an instance-level threading.RLock guarding all read/write paths in Registry, with __iter__ returning a dict snapshot for stable concurrent iteration. Three targeted thread-safety tests with sleep-widened race windows are included and demonstrate the fix is reliable.

Confidence Score: 5/5

Safe to merge; all remaining findings are P2 style/hardening suggestions that do not affect current correctness.

The core thread-safety fixes are correct: double-checked locking in Singleton and RLock-guarded paths in Registry both follow established patterns. The only open items are a latent (not present) deadlock risk from the shared non-reentrant Singleton lock and an unguarded _instances.pop in the test setup — neither affects the current production code or test suite under the default sequential pytest runner.

nemoguardrails/singleton.py — the shared threading.Lock() warrants a follow-up to consider RLock for future-proofing.

Important Files Changed

Filename Overview
nemoguardrails/singleton.py Adds double-checked locking to Singleton.call using a class-level threading.Lock(); the shared non-reentrant lock is a latent deadlock risk if any Singleton.init ever constructs another Singleton.
nemoguardrails/registry.py All mutating and read paths are correctly guarded with an instance-level RLock; iter returns a snapshot to prevent RuntimeError during concurrent mutation.
tests/test_registry.py Three new thread-safety tests with deterministic race windows via sleep; direct unguarded mutation of Singleton._instances in test setup is benign today but fragile under parallel test execution.

Sequence Diagram

sequenceDiagram
    participant T1 as Thread 1
    participant T2 as Thread 2
    participant S as Singleton.__call__
    participant R as Registry (RLock)

    Note over T1,T2: Singleton double-checked locking
    T1->>S: cls not in _instances? (no lock)
    T2->>S: cls not in _instances? (no lock)
    T1->>S: acquire _lock (Lock)
    T2-->>S: block on _lock
    S->>S: cls not in _instances? (2nd check)
    S->>S: _instances[cls] = super().__call__()
    T1->>S: release _lock
    T2->>S: acquire _lock
    S->>S: cls already in _instances - skip
    T2->>S: release _lock
    T1-->>T1: return existing instance
    T2-->>T2: return existing instance

    Note over T1,T2: Registry.add concurrent path
    T1->>R: acquire RLock
    T2-->>R: block on RLock
    R->>R: check name not in items
    R->>R: validate(name, item)
    R->>R: items[name] = item
    T1->>R: release RLock
    T2->>R: acquire RLock
    R->>R: name in items - raise ValueError
    T2->>R: release RLock

    Note over T1,T2: Registry.__iter__ snapshot
    T1->>R: acquire RLock - iter(list(items)) - release
    T2->>R: acquire RLock - add new item - release
    T1-->>T1: iterates stable snapshot (no RuntimeError)
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: nemoguardrails/singleton.py
Line: 28-35

Comment:
**Shared non-reentrant lock risks deadlock on nested Singleton construction**

`_lock` is a single `threading.Lock()` shared across *all* Singleton subclasses. If any Singleton's `__init__` instantiates a second, not-yet-initialized Singleton (e.g., `ClassA.__init__` calls `ClassB()` for the first time), Thread A enters `Singleton.__call__` for `ClassA`, acquires `_lock`, then re-enters `Singleton.__call__` for `ClassB` and blocks on `_lock` — deadlock.

`Registry.__init__` does not have this problem today, but the invariant is invisible to future contributors. Using `threading.RLock()` would eliminate the hazard with no behavioural change in the current code.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: tests/test_registry.py
Line: 100

Comment:
**Unguarded direct mutation of `Singleton._instances`**

`Singleton._instances.pop(_SlowSingleton, None)` mutates the shared global dict without acquiring `Singleton._lock`. In the current sequential test run this is harmless, but if tests are ever parallelised with `pytest-xdist` (even across classes in the same process), the unsynchronised mutation races with the outer `if cls not in cls._instances` read in `Singleton.__call__`, potentially resurrecting a supposed-singleton mid-construction. Wrapping the pop in `with Singleton._lock:` would keep the test consistent with the guarantees being validated.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "refactor(core): make Registry and Single..." | Re-trigger Greptile

Comment on lines +28 to 35
_lock = threading.Lock()

def __call__(cls, *args, **kwargs):
"""Call method for the singleton metaclass."""
if cls not in cls._instances:
cls._instances[cls] = super(Singleton, cls).__call__(*args, **kwargs)
with cls._lock:
if cls not in cls._instances:
cls._instances[cls] = super().__call__(*args, **kwargs)
return cls._instances[cls]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Shared non-reentrant lock risks deadlock on nested Singleton construction

_lock is a single threading.Lock() shared across all Singleton subclasses. If any Singleton's __init__ instantiates a second, not-yet-initialized Singleton (e.g., ClassA.__init__ calls ClassB() for the first time), Thread A enters Singleton.__call__ for ClassA, acquires _lock, then re-enters Singleton.__call__ for ClassB and blocks on _lock — deadlock.

Registry.__init__ does not have this problem today, but the invariant is invisible to future contributors. Using threading.RLock() would eliminate the hazard with no behavioural change in the current code.

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemoguardrails/singleton.py
Line: 28-35

Comment:
**Shared non-reentrant lock risks deadlock on nested Singleton construction**

`_lock` is a single `threading.Lock()` shared across *all* Singleton subclasses. If any Singleton's `__init__` instantiates a second, not-yet-initialized Singleton (e.g., `ClassA.__init__` calls `ClassB()` for the first time), Thread A enters `Singleton.__call__` for `ClassA`, acquires `_lock`, then re-enters `Singleton.__call__` for `ClassB` and blocks on `_lock` — deadlock.

`Registry.__init__` does not have this problem today, but the invariant is invisible to future contributors. Using `threading.RLock()` would eliminate the hazard with no behavioural change in the current code.

How can I resolve this? If you propose a fix, please make it concise.

Comment thread tests/test_registry.py
time.sleep(0.01)
type(self).construction_count += 1

Singleton._instances.pop(_SlowSingleton, None)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unguarded direct mutation of Singleton._instances

Singleton._instances.pop(_SlowSingleton, None) mutates the shared global dict without acquiring Singleton._lock. In the current sequential test run this is harmless, but if tests are ever parallelised with pytest-xdist (even across classes in the same process), the unsynchronised mutation races with the outer if cls not in cls._instances read in Singleton.__call__, potentially resurrecting a supposed-singleton mid-construction. Wrapping the pop in with Singleton._lock: would keep the test consistent with the guarantees being validated.

Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/test_registry.py
Line: 100

Comment:
**Unguarded direct mutation of `Singleton._instances`**

`Singleton._instances.pop(_SlowSingleton, None)` mutates the shared global dict without acquiring `Singleton._lock`. In the current sequential test run this is harmless, but if tests are ever parallelised with `pytest-xdist` (even across classes in the same process), the unsynchronised mutation races with the outer `if cls not in cls._instances` read in `Singleton.__call__`, potentially resurrecting a supposed-singleton mid-construction. Wrapping the pop in `with Singleton._lock:` would keep the test consistent with the guarantees being validated.

How can I resolve this? If you propose a fix, please make it concise.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 21, 2026

📝 Walkthrough

Walkthrough

The changes add thread-safety to the Registry and Singleton classes by implementing locking mechanisms. The Registry now uses a reentrant lock to protect all state mutations and reads, while Singleton implements double-checked locking for safe instance creation. Comprehensive concurrency tests validate these synchronization additions.

Changes

Cohort / File(s) Summary
Core Synchronization
nemoguardrails/registry.py, nemoguardrails/singleton.py
Added threading locks to ensure thread-safe access to shared state. Registry guards all item access with RLock and snapshots iteration; Singleton uses double-checked locking pattern for instance creation with type-annotated _instances dict.
Test Coverage
tests/test_registry.py
Added TestThreadSafety test suite with three concurrency-focused tests: singleton instantiation under threads, concurrent registry adds for same key, and concurrent iteration during mutation. Includes SlowValidateRegistry helper class and thread synchronization imports.

Sequence Diagram(s)

sequenceDiagram
    participant T1 as Thread 1
    participant T2 as Thread 2
    participant SM as Singleton Metaclass
    participant Lock
    participant Cache

    T1->>SM: __call__() - create instance
    T2->>SM: __call__() - create instance
    
    Note over T1,T2: First check (no lock)
    T1->>Cache: _instances[cls] exists?
    Cache-->>T1: No
    T2->>Cache: _instances[cls] exists?
    Cache-->>T2: No
    
    Note over T1,T2: Both proceed to acquire lock
    T1->>Lock: acquire()
    T1->>Lock: ✓ locked
    T2->>Lock: acquire()
    T2->>Lock: waiting...
    
    Note over T1: Second check (with lock held)
    T1->>Cache: _instances[cls] exists?
    Cache-->>T1: No
    T1->>Cache: create & store instance
    T1->>Lock: release()
    
    T2->>Lock: ✓ acquired
    Note over T2: Second check (with lock held)
    T2->>Cache: _instances[cls] exists?
    Cache-->>T2: Yes (created by T1)
    T2->>Lock: release()
    
    T1-->>T1: return instance
    T2-->>T2: return same instance
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically identifies the main change: making Registry and Singleton classes thread-safe through concurrency fixes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Test Results For Major Changes ✅ Passed PR makes significant thread-safety changes to Registry and Singleton with comprehensive test coverage including three new concurrent stress tests validating TOCTOU race condition fixes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch refactor/registry-thread-safety

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/test_registry.py (1)

164-170: Broad exception catch is acceptable here, but consider narrowing for clarity.

The static analysis warning about catching Exception is technically valid, but in this stress test context, catching broadly is reasonable since we want to detect any failure mode. If you prefer to silence the warning while being more explicit about intent:

♻️ Optional: Narrow the exception type
         def reader():
             try:
                 for _ in range(100):
                     for _name in registry:
                         pass
-            except Exception as e:
+            except RuntimeError as e:
+                # Catches "dictionary changed size during iteration"
                 errors.append(e)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_registry.py` around lines 164 - 170, The test's reader() function
currently catches Exception broadly which triggers static-analysis warnings;
update reader to either narrow the caught type to a more specific exception you
expect (e.g., RuntimeError or KeyError) or explicitly document intent by keeping
Exception but adding a suppression/comment (e.g., "# noqa: B110" or a brief
comment) so linters know this broad catch is intentional; reference the reader
function, the registry iteration, and the errors list to implement the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/test_registry.py`:
- Around line 164-170: The test's reader() function currently catches Exception
broadly which triggers static-analysis warnings; update reader to either narrow
the caught type to a more specific exception you expect (e.g., RuntimeError or
KeyError) or explicitly document intent by keeping Exception but adding a
suppression/comment (e.g., "# noqa: B110" or a brief comment) so linters know
this broad catch is intentional; reference the reader function, the registry
iteration, and the errors list to implement the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 0fa7281e-e122-4d02-9b70-a7cf3c02a1ee

📥 Commits

Reviewing files that changed from the base of the PR and between 032a661 and 22f5fed.

📒 Files selected for processing (3)
  • nemoguardrails/registry.py
  • nemoguardrails/singleton.py
  • tests/test_registry.py

@Pouyanpi Pouyanpi force-pushed the develop branch 4 times, most recently from a6be550 to c69efe5 Compare May 6, 2026 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant