Developer notes for implemented features #739

lla-dane · 2025-07-02T15:04:20Z

lla-dane
Jul 2, 2025

🛠️ Developer Notes for Implemented Features

This thread serves as a shared space for py-libp2p contributors to leave notes, insights, and context about the issues, features, and fixes they work on.

Whether it's:

A workaround for an unexpected bug 🐛

A design decision you debated 💭

A test setup that helped debug a tricky scenario 🧪

Or just the steps you followed to resolve a TODO item ✅

Please document it here. Think of this as a living journal of the dev experience — your notes might save someone else hours of head-scratching.

📚 Why this matters

It helps newcomers learn how to approach issues in this codebase

It builds a transparent understanding of why certain changes were made

It captures tribal knowledge that usually gets lost in commit messages

📝 How to contribute

When you finish a PR or solve an issue:

Briefly summarize the problem and your approach

Mention any gotchas or alternative paths you explored

Link to the relevant PR/issue

lla-dane · 2025-07-02T16:54:43Z

lla-dane
Jul 2, 2025
Author

#696 fix: added negotiate timeout to MuxerMultistream

Resolved FIXME: add negotiate timeout to MuxerMultistream in the multiselect and multiselect_client logic in protocol_muxer

Problem

The core issue was that during multistream-select protocol negotiations, read operations like this in the handshake function:

async def handshake(self, communicator: IMultiselectCommunicator) -> None:
    """
    Perform handshake to agree on multiselect protocol.

    :param communicator: communicator to use
    :raise MultiselectError: raised when handshake failed
    """
    try:
        await communicator.write(MULTISELECT_PROTOCOL_ID)
    except MultiselectCommunicatorError as error:
        raise MultiselectError() from error

    try:
        handshake_contents = await communicator.read()
    except MultiselectCommunicatorError as error:
        raise MultiselectError() from error

    if not is_valid_handshake(handshake_contents):
        raise MultiselectError(
            "multiselect protocol ID mismatch: "
            f"received handshake_contents={handshake_contents}"
        )

could hang indefinitely of remote peers never responded or stalled without closing the stream. This could cause the entire connection negotiation pipeline to stall, blocking the event loop and delaying peer discovery operations.

The problem occurred in three critical functions that read from partially established p2p connections:

handshake: Local peer writes MULTISELECT_PROTOCOL_ID to the remote peer and waits for response
query_multistream_command - Used for commands like 'ls' to discover supported protocols.
try_select - Protocol negotiation after multistream_select handshake
In each case, the await communicator.read calls could suspend forever if the remote peer misbehaved or became unresponsive.

Implementation Approach

Two architectural approaches were considered:

Option 1: Add timeout wrappers in the two select_transport functions where negotiate and select_one_of are called

Option 2: Implement timeout logic directly inside negotiate and select_one_of functions

Decision: Chose Option 2 for better encapsulation and consistency. This ensures timeout behavior is guaranteed regardless of where these functions are called, avoiding code duplication.

Technical Implementation

Core Timeout Implementation

Used trio.fail_after(DEFAULT_NEGOTIATE_TIMEOUT) to wrap read operations:

try:
    with trio.fail_after(DEFAULT_NEGOTIATE_TIMEOUT):  # 5 seconds
        response = await communicator.read()
except trio.TooSlowError:
    raise MultiselectClientError("protocol selection response timed out")
except MultiselectCommunicatorError as error:
    raise MultiselectClientError() from error

Modified Functions

multiselect.py - negotiate function:
- Wrapped the entire negotiation loop with timeout
- Added proper error handling for trio.TooSlowError
multiselect_client.py - select_one_of function:
- Added timeout parameter with default value
- Wrapped handshake and protocol selection logic
multiselect_client.py - query_multistream_command function:
- Added response_timeout parameter
- Protected both handshake and command response phases

Configuration Management

Made DEFAULT_NEGOTIATE_TIMEOUT configurable at multiple levels:

Default Value: 5 seconds (changed from initial 60 seconds after review feedback)
Function Level: Each function accepts timeout parameters
Host Level: BasicHost constructor accepts negotitate_timeout parameter
Application Level: new_host() function accepts negotiate_timeout parameter

Testing Strategy

Challenge: Simulation timeout conditions
Initial testing approach involved running nodes in separate codebases with intentional infinite loops, but this was complex for unit testing.

Solution: Mock Communicator approach
Created a DummyMultiselectCommunicator that simulates misbehaving peers

class DummyMultiselectCommunicator(IMultiselectCommunicator):
    async def write(self, msg_str: str) -> None:
        """Goes into infinite loop when .write is called"""
        await trio.sleep_forever()
    
    async def read(self) -> str:
        """Returns a dummy read"""
        return "dummy_read"

Test Coverage

Created comprehensive tests for all three modified functions:

test_select_one_of_timeout()
test_query_multistream_command_timeout()
test_negotiate_timeout()

Each test verifies that the appropriate timeout exceptions are raised within the expected timeframe.

Results

This implementation successfully resolves the negotiate timeout issue by:

Preventing indefinite hangs during protocol negotiation
Maintaining backward compatibility through optional parameters
Providing configurable behavior for different deployment scenarios
Ensuring consistent error handling across the multiselect stack
Improving overall protocol robustness against misbehaving peers

0 replies

lla-dane · 2025-07-02T17:28:37Z

lla-dane
Jul 2, 2025
Author

#708 TODO: Adding Concurrency Cap to Identify Push Handling

Implements a TODO: Consider using a bounded nursery to limit concurrency and avoid overwhelming the network. in the identify_push protocol in push_identify_to_peers.

Problem

The core issue was that the function used an unbounded trio.open_nursery() to spawn one task per peer:

async def push_identify_to_peers(
    host: IHost,
    peer_ids: set[ID] | None = None,
    observed_multiaddr: Multiaddr | None = None,
) -> None:
    """
    Push an identify message to multiple peers in parallel.

    If peer_ids is None, push to all connected peers.
    """
    if peer_ids is None:
        # Get all connected peers
        peer_ids = set(host.get_peerstore().peer_ids())

    async with trio.open_nursery() as nursery:
        for peer_id in peer_ids:
            nursery.start_soon(push_identify_to_peer, host, peer_id, observed_multiaddr)

meaning if a node was connected to 500 or 1000 peers, it would attempt to push identify information to all of them simultaneously without any throttling mechanism.

The unbounded concurrency approach caused several critical problems at scale:

Too many concurrent network connections: Each peer push requires establishing a network connection
"Too many open files" errors: Exhausting system file descriptors due to numerous simultaneous socket connections
High CPU load: Excessive context switching between too many concurrent tasks
Event loop degradation: The async event loop becomes bogged down, slowing down all other operations

The lack of throttling meant the system would "fire everything off as fast as it can," creating potential network congestion and overwhelming both local resources and remote peers.

Implementation approach

The solution implemented a semaphore-based throttling mechanism using trio.Semaphore to limit concurrent identify push operations.

Strategy

CONCURRENCY_LIMIT = 10  # Default limit

semaphore = trio.Semaphore(CONCURRENCY_LIMIT)

async def limited_push(peer_id: ID) -> None:
    async with semaphore:
        await push_identify_to_peer(host, peer_id, observed_multiaddr)

This approach ensures that only 10 concurrent identify pushes run at any given time while the rest wait their turn.

Technicals

Concurrency Control Implementation: The solution wraps the existing push_identify_to_peer calls with a semaphore-controlled wrapper function:

async def limited_push(peer_id: ID) -> None:
    async with semaphore:
        if counter is not None and lock is not None:
            async with lock:
                counter["current"] += 1
                counter["max"] = max(counter["max"], counter["current"])
        
        try:
            await push_identify_to_peer(host, peer_id, observed_multiaddr)
        finally:
            if counter is not None and lock is not None:
                async with lock:
                    counter["current"] -= 1

Function Signature Enhancement: Modified push_identify_to_peers to include optional debugging parameters:

counter: dict[str, int] | None = None - For tracking concurrent task counts
lock: trio.Lock | None = None - For thread-safe counter access
Return type changed from None to int to return maximum concurrency reached

Configuration Management

Default Concurrency Limit: Set to 10 concurrent operations
Configurable via constant: CONCURRENCY_LIMIT = 10 at module level
Non-breaking change: All new parameters are optional with sensible defaults

Testing Strategy

Basic Concurrency Limit Verification
Test: test_identify_push_respects_concurrency_limit

Peers: 50 dummy peers
Validation: Ensures maximum concurrency never exceeds 10
Mechanism: Uses counter and lock to track concurrent operations

High-Load Concurrency Testing
Test: test_semaphore_limit_is_respected_under_high_peer_load

Peers: 499 dummy peers (maximum before hitting Trio's internal limits)
Challenge Discovered: Test fails with 500 peers due to TooManyChildrenException
Root Cause: Trio enforces a hard limit of 1000 child tasks per nursery, and each host spawns multiple background tasks1

Functional Correctness Validation
Test: test_all_peers_receive_identify_push_with_semaphore

Purpose: Ensures throttling doesn't break identify message delivery
Validation: All peers receive identify information despite concurrency limits
Peers: 50 dummy peers

High-Load Functional Testing
Test: test_all_peers_receive_identify_push_with_semaphore_under_high_peer_load

Purpose: Functional correctness under extreme load
Peers: 499 dummy peers
Same limitation: Hits Trio's internal task limit at 500 peers

Results

Immediate Benefits

Resource Stability: Prevents system resource exhaustion under high peer counts
Predictable Performance: Linear scaling behavior instead of exponential resource growth
System Robustness: Elimination of "too many open files" errors

Long-term Advantages

Scalability: Enables stable operation with hundreds or thousands of connected peers
Maintainability: Clear concurrency control makes the system behavior more predictable
Debuggability: Built-in concurrency monitoring assists with performance analysis

0 replies

lla-dane · 2025-07-07T15:00:38Z

lla-dane
Jul 7, 2025
Author

#648 Matching py-libp2p <-> go-libp2p PeerStore Implementation

Resolved the issue #251 Match peer-store implementation in go-libp2p.
This PR refactors the py-libp2p peerstore, and brings it to near-parity with the go-libp2p implementation. The new design modularizess state management (addresses, keys, metadata, protocols and metrics), introduces real-time address streaming, and adds first class latency tracking.

Problem

Before this PR, py-libp2p shipped only a skeletal PeerStore:

All peer data lived in a monolithic PeerStore class with limited CRUD helpers.
No book-specific abstractions (AddrBook, KeyBook, ProtoBook, etc.) existed, so porting code from Go required extensive refactor.
Address updates were static: once inserted, there was no async “stream” for consumers such as AutoNAT or relay code to react to live changes.

Implementation details

New ABC	Responsibility	Key Methods
`IAddrBook`	Multiaddr storage & TTL	`add_addrs`, `addr_stream`, `peers_with_addrs`
`IKeyBook`	Cryptographic material	`add_pubkey`, `add_privkey`, `peer_with_keys`
`IProtoBook`	Protocol negotiation cache	`add_protocols`, `supports_protocols`, `first_supported_protocol`
`IPeerMetadata`	Arbitrary KV store	`put`, `get`, `clear_metadata`
`IMetrics`	Latency EWMA	`record_latency`, `latency_EWMA`, `clear_metrics`
`IPeerStore`	Composite of all above	`peer_info`, `peer_ids`

This separation improves encapsulation and allows individual subsystems to depend only on the book they need.

Concrete PeerStore rewrite - peer/peerstore.py now:

Exposes an async addr_stream(pee_id) that returns a Trio MemoryRecieveChannel -- clients iterate to receive live Multiaddrs.

    async def addr_stream(self, peer_id: ID) -> AsyncIterable[Multiaddr]:
    """
    Returns an async stream of newly added addresses for the given peer.
    This function allows consumers to subscribe to address updates for a peer
    and receive each new address as it is added via `add_addr` or `add_addrs`.
    :param peer_id: The ID of the peer to monitor address updates for.
    :return: An async iterator yielding Multiaddr instances as they are added.
    """
    send: MemorySendChannel[Multiaddr]
    receive: MemoryReceiveChannel[Multiaddr]

    send, receive = trio.open_memory_channel(0)
    self.addr_update_channels[peer_id] = send

    async for addr in receive:
        yield addr

When addr_stream(peer_id) is called, a new channel is created and stored in the addr_update_channels map. Then whenever a new address is added via add_addr or add_addrs, it's pushed into that channel (if it exists).
Anyone consuming addr_stream can async for over it to get notified about each new address as it appears. Also added a test for this function in tests/core/peer/test_peerstore.py.

Provides full CRUD for keys, metatdata, protocols, and metrics, delegating to PeerData.

AddrStream Design Choice: Instead of Go's buffered channel, Trio's zero-buffer open_memory_channel(0) is used:
- Guarantees back-pressure -- slow consumers pause producers.
- Maintains ordering without extra mutexes.
- Simplifies lifetime management: addr_update_channels dict tracks send sends keyed per peer.
Enhanced PeerData - peer/peerdata.py adds:
- Latency EWMA: constant LATENCY_EWMA_SMOOTHING = 0.1; record_latency() applies EWMA(new) = (1-s).odd + s.RTT.
- Safe clears(clear_metadata, clear_keydata, clear_protocol_data, clear_metrics).
- Protocol helpers: remove_protocols, supports_protocols, first_supported_protocol for one-shot negotiation parity with Go.
Extensive Test Suite: Now tests under tests/core/peer cover:
- Protocol CRUD, first-match logic, removal and clearing.
- Metadata KV operations and wipe-out.
- KeyBook integrity (peer_with_keys, clear_keydata).
- Metrics EWMA trajectory and reset.
- Async addr_stream: uses Trio nursery to assert two addresses arrive in order

Key Design Decisions

Decision	Rationale
Trio Memory Channels for `AddrStream`	Native, lock-free, fits Trio’s nursery model; avoids explicit queue size management.
Book Modularization vs single class	Mirrors Go, fosters type-safe DI, eases partial mocking in tests.
EWMA smoothing = 0.1	Matches Go defaults; provides moderate responsiveness without churn.
Zero-buffer channel	Propagates back-pressure, guaranteeing consumer readiness instead of silent drops.

0 replies

lla-dane · 2025-07-27T05:05:15Z

lla-dane
Jul 27, 2025
Author

#753 Certified-Addr-Book interface for Peer-Store:

This post will referencing on how the signed-peer-record transfer is integrated in the identify/identify_push protocols, in reference with go-libp2p.

In this PR an optional signedPeerRecord schema is added in the identify protobuf message format which will be sent down automatically to the incoming/outgoing nodes upon first time connection, as per the behavior of identify protocol.

After this, while creating the identify message to be sent in the identify.py, an extra signed-peer-record creation mechanism has been added and sent down with the Identify message like this:

After this, in identify_push.py, a handler has been added to process the signed-peer-record if it has been sent within the identify message:

The Peer record transfer can be checked out with identify-demo example:

This is how the peer-record transfer has been integrated with the Identify protocol.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Developer notes for implemented features #739

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Developer notes for implemented features #739

Uh oh!

Uh oh!

lla-dane Jul 2, 2025

🛠️ Developer Notes for Implemented Features

📚 Why this matters

📝 How to contribute

Replies: 4 comments

Uh oh!

Uh oh!

lla-dane Jul 2, 2025 Author

#696 fix: added negotiate timeout to MuxerMultistream

Problem

Implementation Approach

Technical Implementation

Core Timeout Implementation

Modified Functions

Configuration Management

Testing Strategy

Test Coverage

Results

Uh oh!

lla-dane Jul 2, 2025 Author

#708 TODO: Adding Concurrency Cap to Identify Push Handling

Problem

Implementation approach

Strategy

Technicals

Testing Strategy

Results

Immediate Benefits

Long-term Advantages

Uh oh!

lla-dane Jul 7, 2025 Author

#648 Matching py-libp2p <-> go-libp2p PeerStore Implementation

Problem

Implementation details

Key Design Decisions

Uh oh!

Uh oh!

lla-dane Jul 27, 2025 Author

#753 Certified-Addr-Book interface for Peer-Store:

lla-dane
Jul 2, 2025

lla-dane
Jul 2, 2025
Author

lla-dane
Jul 2, 2025
Author

lla-dane
Jul 7, 2025
Author

lla-dane
Jul 27, 2025
Author