refactor(registry): make Registry and Singleton thread-safe#1803
refactor(registry): make Registry and Singleton thread-safe#1803Pouyanpi wants to merge 1 commit into
Conversation
Both Singleton and Registry had TOCTOU races on compound
check-then-mutate operations. Under concurrent construction,
Singleton could produce multiple "singleton" instances; under
concurrent add(), Registry could silently accept duplicate
registrations; Registry iteration during mutation could raise
RuntimeError("dictionary changed size during iteration").
Fixes:
- Singleton: double-checked locking on __call__ using a shared
threading.Lock. Fast path (already constructed) stays lock-free.
- Registry: instance-level threading.RLock guards add/get/reset
and the read paths (list, __len__, __iter__, __contains__,
__repr__). __iter__ returns a snapshot so iteration is stable
even if writers mutate concurrently.
Tests (tests/test_registry.py::TestThreadSafety):
- Singleton: 16 threads racing construction see exactly one
instance and exactly one __init__ call (slow __init__ widens
the race window to reliably expose the bug).
- Registry: 16 threads adding the same name — exactly 1 success,
15 ValueError (slow validate widens the check-then-set window).
- Registry iteration: reader thread iterates while writer adds
concurrently; no RuntimeError.
All three tests fail reliably against pre-fix code and pass
consistently with the fix.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Greptile SummaryThis PR fixes three TOCTOU races in
|
| Filename | Overview |
|---|---|
| nemoguardrails/singleton.py | Adds double-checked locking to Singleton.call using a class-level threading.Lock(); the shared non-reentrant lock is a latent deadlock risk if any Singleton.init ever constructs another Singleton. |
| nemoguardrails/registry.py | All mutating and read paths are correctly guarded with an instance-level RLock; iter returns a snapshot to prevent RuntimeError during concurrent mutation. |
| tests/test_registry.py | Three new thread-safety tests with deterministic race windows via sleep; direct unguarded mutation of Singleton._instances in test setup is benign today but fragile under parallel test execution. |
Sequence Diagram
sequenceDiagram
participant T1 as Thread 1
participant T2 as Thread 2
participant S as Singleton.__call__
participant R as Registry (RLock)
Note over T1,T2: Singleton double-checked locking
T1->>S: cls not in _instances? (no lock)
T2->>S: cls not in _instances? (no lock)
T1->>S: acquire _lock (Lock)
T2-->>S: block on _lock
S->>S: cls not in _instances? (2nd check)
S->>S: _instances[cls] = super().__call__()
T1->>S: release _lock
T2->>S: acquire _lock
S->>S: cls already in _instances - skip
T2->>S: release _lock
T1-->>T1: return existing instance
T2-->>T2: return existing instance
Note over T1,T2: Registry.add concurrent path
T1->>R: acquire RLock
T2-->>R: block on RLock
R->>R: check name not in items
R->>R: validate(name, item)
R->>R: items[name] = item
T1->>R: release RLock
T2->>R: acquire RLock
R->>R: name in items - raise ValueError
T2->>R: release RLock
Note over T1,T2: Registry.__iter__ snapshot
T1->>R: acquire RLock - iter(list(items)) - release
T2->>R: acquire RLock - add new item - release
T1-->>T1: iterates stable snapshot (no RuntimeError)
Prompt To Fix All With AI
This is a comment left during a code review.
Path: nemoguardrails/singleton.py
Line: 28-35
Comment:
**Shared non-reentrant lock risks deadlock on nested Singleton construction**
`_lock` is a single `threading.Lock()` shared across *all* Singleton subclasses. If any Singleton's `__init__` instantiates a second, not-yet-initialized Singleton (e.g., `ClassA.__init__` calls `ClassB()` for the first time), Thread A enters `Singleton.__call__` for `ClassA`, acquires `_lock`, then re-enters `Singleton.__call__` for `ClassB` and blocks on `_lock` — deadlock.
`Registry.__init__` does not have this problem today, but the invariant is invisible to future contributors. Using `threading.RLock()` would eliminate the hazard with no behavioural change in the current code.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: tests/test_registry.py
Line: 100
Comment:
**Unguarded direct mutation of `Singleton._instances`**
`Singleton._instances.pop(_SlowSingleton, None)` mutates the shared global dict without acquiring `Singleton._lock`. In the current sequential test run this is harmless, but if tests are ever parallelised with `pytest-xdist` (even across classes in the same process), the unsynchronised mutation races with the outer `if cls not in cls._instances` read in `Singleton.__call__`, potentially resurrecting a supposed-singleton mid-construction. Wrapping the pop in `with Singleton._lock:` would keep the test consistent with the guarantees being validated.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "refactor(core): make Registry and Single..." | Re-trigger Greptile
| _lock = threading.Lock() | ||
|
|
||
| def __call__(cls, *args, **kwargs): | ||
| """Call method for the singleton metaclass.""" | ||
| if cls not in cls._instances: | ||
| cls._instances[cls] = super(Singleton, cls).__call__(*args, **kwargs) | ||
| with cls._lock: | ||
| if cls not in cls._instances: | ||
| cls._instances[cls] = super().__call__(*args, **kwargs) | ||
| return cls._instances[cls] |
There was a problem hiding this comment.
Shared non-reentrant lock risks deadlock on nested Singleton construction
_lock is a single threading.Lock() shared across all Singleton subclasses. If any Singleton's __init__ instantiates a second, not-yet-initialized Singleton (e.g., ClassA.__init__ calls ClassB() for the first time), Thread A enters Singleton.__call__ for ClassA, acquires _lock, then re-enters Singleton.__call__ for ClassB and blocks on _lock — deadlock.
Registry.__init__ does not have this problem today, but the invariant is invisible to future contributors. Using threading.RLock() would eliminate the hazard with no behavioural change in the current code.
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemoguardrails/singleton.py
Line: 28-35
Comment:
**Shared non-reentrant lock risks deadlock on nested Singleton construction**
`_lock` is a single `threading.Lock()` shared across *all* Singleton subclasses. If any Singleton's `__init__` instantiates a second, not-yet-initialized Singleton (e.g., `ClassA.__init__` calls `ClassB()` for the first time), Thread A enters `Singleton.__call__` for `ClassA`, acquires `_lock`, then re-enters `Singleton.__call__` for `ClassB` and blocks on `_lock` — deadlock.
`Registry.__init__` does not have this problem today, but the invariant is invisible to future contributors. Using `threading.RLock()` would eliminate the hazard with no behavioural change in the current code.
How can I resolve this? If you propose a fix, please make it concise.| time.sleep(0.01) | ||
| type(self).construction_count += 1 | ||
|
|
||
| Singleton._instances.pop(_SlowSingleton, None) |
There was a problem hiding this comment.
Unguarded direct mutation of
Singleton._instances
Singleton._instances.pop(_SlowSingleton, None) mutates the shared global dict without acquiring Singleton._lock. In the current sequential test run this is harmless, but if tests are ever parallelised with pytest-xdist (even across classes in the same process), the unsynchronised mutation races with the outer if cls not in cls._instances read in Singleton.__call__, potentially resurrecting a supposed-singleton mid-construction. Wrapping the pop in with Singleton._lock: would keep the test consistent with the guarantees being validated.
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/test_registry.py
Line: 100
Comment:
**Unguarded direct mutation of `Singleton._instances`**
`Singleton._instances.pop(_SlowSingleton, None)` mutates the shared global dict without acquiring `Singleton._lock`. In the current sequential test run this is harmless, but if tests are ever parallelised with `pytest-xdist` (even across classes in the same process), the unsynchronised mutation races with the outer `if cls not in cls._instances` read in `Singleton.__call__`, potentially resurrecting a supposed-singleton mid-construction. Wrapping the pop in `with Singleton._lock:` would keep the test consistent with the guarantees being validated.
How can I resolve this? If you propose a fix, please make it concise.
📝 WalkthroughWalkthroughThe changes add thread-safety to the Changes
Sequence Diagram(s)sequenceDiagram
participant T1 as Thread 1
participant T2 as Thread 2
participant SM as Singleton Metaclass
participant Lock
participant Cache
T1->>SM: __call__() - create instance
T2->>SM: __call__() - create instance
Note over T1,T2: First check (no lock)
T1->>Cache: _instances[cls] exists?
Cache-->>T1: No
T2->>Cache: _instances[cls] exists?
Cache-->>T2: No
Note over T1,T2: Both proceed to acquire lock
T1->>Lock: acquire()
T1->>Lock: ✓ locked
T2->>Lock: acquire()
T2->>Lock: waiting...
Note over T1: Second check (with lock held)
T1->>Cache: _instances[cls] exists?
Cache-->>T1: No
T1->>Cache: create & store instance
T1->>Lock: release()
T2->>Lock: ✓ acquired
Note over T2: Second check (with lock held)
T2->>Cache: _instances[cls] exists?
Cache-->>T2: Yes (created by T1)
T2->>Lock: release()
T1-->>T1: return instance
T2-->>T2: return same instance
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/test_registry.py (1)
164-170: Broad exception catch is acceptable here, but consider narrowing for clarity.The static analysis warning about catching
Exceptionis technically valid, but in this stress test context, catching broadly is reasonable since we want to detect any failure mode. If you prefer to silence the warning while being more explicit about intent:♻️ Optional: Narrow the exception type
def reader(): try: for _ in range(100): for _name in registry: pass - except Exception as e: + except RuntimeError as e: + # Catches "dictionary changed size during iteration" errors.append(e)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_registry.py` around lines 164 - 170, The test's reader() function currently catches Exception broadly which triggers static-analysis warnings; update reader to either narrow the caught type to a more specific exception you expect (e.g., RuntimeError or KeyError) or explicitly document intent by keeping Exception but adding a suppression/comment (e.g., "# noqa: B110" or a brief comment) so linters know this broad catch is intentional; reference the reader function, the registry iteration, and the errors list to implement the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/test_registry.py`:
- Around line 164-170: The test's reader() function currently catches Exception
broadly which triggers static-analysis warnings; update reader to either narrow
the caught type to a more specific exception you expect (e.g., RuntimeError or
KeyError) or explicitly document intent by keeping Exception but adding a
suppression/comment (e.g., "# noqa: B110" or a brief comment) so linters know
this broad catch is intentional; reference the reader function, the registry
iteration, and the errors list to implement the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 0fa7281e-e122-4d02-9b70-a7cf3c02a1ee
📒 Files selected for processing (3)
nemoguardrails/registry.pynemoguardrails/singleton.pytests/test_registry.py
a6be550 to
c69efe5
Compare
Description
SingletonandRegistryhave TOCTOU (time-of-check-time-of-use) races.Singleton.__call__: two threads racingMyRegistry()can both see the class missing from_instancesand both construct an instance thus producing multiple "singleton" objects. Only one wins the final assignment; work already in flight on the loser holds a stale reference.Registry.add: two threads racingreg.add("x", ...)can both see"x" not in itemsand both assign so silently accepting duplicate registrations instead of raisingValueError.Registryiteration: a reader iteratingfor name in reg:while another thread callsreg.add(...)raisesRuntimeError: dictionary changed size during iteration.Impact: duplicate httpx connection pools in
LLMFrameworkRegistry, silently lost registrations, crashes in concurrent read/write paths. Affects every consumer of this shared infrastructure —LogAdapterRegistry,EmbeddingProviderRegistry, and the incomingLLMFrameworkRegistry.Fix
Singleton: double checked locking on__call__using a sharedthreading.Lock.Registry: instance levelthreading.RLockguardsadd,get,reset, and the read paths (list,__len__,__iter__,__contains__,__repr__).__iter__returns a snapshot (iter(list(self.items))) so iteration is stable even if writers mutates concurrently.Test plan
Three new tests in
tests/test_registry.py::TestThreadSafety: Each test uses a slow path in the vulnerable critical section (sleep in__init__/validate) to widen the race window so the bug is deterministically reproducible.Before the fix
Reverting the production code and running the new tests (10 consecutive runs):
After the fix
Consistent across 5 consecutive runs.
Summary by CodeRabbit
Bug Fixes
Tests