Skip to content

nimble/phy/nrf5x: fix race conditions in TIMER0 setup #2036

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jwise
Copy link

@jwise jwise commented May 16, 2025

In the nRF5x LL PHY, we use TIMER0 to trigger various Tx/Rx sequencing tasks. The PHY attempted to recover from some cases in which it was 'late' setting up TIMER0; in theory, these ought not have happened, but if interrupt priorities do not provide absolute priority for the LL (for instance, in RTOS systems that have other priorities, or scheduling systems that have incompatible interrupt priority in order to call into RTOS queue management routines), then the PHY can indeed end up failing to schedule a transmit / receive window in time.

In some cases, the recovery process for these had race conditions, which could result in the LL state machine locking up (the end-of-tx interrupt would never fire). One race condition was extremely short (a few instructions wide; if the timer was set, but the timer triggered before the appropriate PPI could be configured); the other race condition was a microsecond wide (comparison to see if the timer had already triggered needed the current time to be greater in order to fail, but if it triggered within that microsecond, the current time would be equal). There are also other potential issues around timer event reset and check that we did not replicate, but did notice by inspection.

These cases appeared in a few different forms, so we factor out a "safe reset" routine to ensure that a timer will not trigger (to be used before setting up a PPI), and then we also factor out a routine to check whether a timer has been missed.

This change is verified to fix some observed stability issues in PebbleOS under heavy load.

cc: @gmarull @crc32 @sjp4 @ericmigi

@github-actions github-actions bot added the size/S Small PR label May 16, 2025
@sjanc sjanc requested a review from andrzej-kaczmarek May 16, 2025 06:43
Copy link
Contributor

@andrzej-kaczmarek andrzej-kaczmarek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, few style comments

* capture to it), and then clearing the event if it was set.
*/
static void
timer0_safely_reset(int timer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int cc

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bumped in 63e3946.

* to a scratch counter and comparing with wraparound.
*/
static bool
timer0_did_miss(int timer, int scratch)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int cc, int scratch_cc

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bumped in 63e3946.

{
nrf_timer_task_trigger(NRF_TIMER0, nrf_timer_capture_task_get(scratch));
uint32_t ticks_rem = NRF_TIMER0->CC[timer] - NRF_TIMER0->CC[scratch];
return (ticks_rem >= 0x80000000) && !NRF_TIMER0->EVENTS_COMPARE[timer];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can simply compare NRF_TIMER0->CC[timer] < NRF_TIMER0->CC[scratch]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, this is a fix for the bug that I originally suspected I was hitting, but in the end I was not :-)

NRF_TIMER0->CC is a uint32. If scratch_cc wrapped but cc did not -- i.e., NRF_TIMER0->CC[cc] is 0xFFFF_FFFE but NRF_TIMER0->CC[scratch_cc] is 0x0000_0002, then it's not true that NRF_TIMER0->CC[cc] < NRF_TIMER0->CC[scratch_cc], but it is true that (NRF_TIMER0->CC[cc] - NRF_TIMER0->CC[scratch_cc]) >= 0x80000000.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIMER0 is reset on each event so in practice it cannot wrap around during an event, unless there's some scenario I'm not aware of.

@jwise
Copy link
Author

jwise commented May 26, 2025

Thanks! I'll make these changes tomorrow -- it's a holiday in the US today.

In the nRF5x LL PHY, we use TIMER0 to trigger various Tx/Rx sequencing
tasks.  The PHY attempted to recover from some cases in which it was 'late'
setting up TIMER0; in theory, these ought not have happened, but if
interrupt priorities do not provide absolute priority for the LL (for
instance, in RTOS systems that have other priorities, or scheduling systems
that have incompatible interrupt priority in order to call into RTOS queue
management routines), then the PHY can indeed end up failing to schedule a
transmit / receive window in time.

In some cases, the recovery process for these had race conditions, which
could result in the LL state machine locking up (the end-of-tx interrupt
would never fire).  One race condition was extremely short (a few
instructions wide; if the timer was set, but the timer triggered before the
appropriate PPI could be configured); the other race condition was a
microsecond wide (comparison to see if the timer had already triggered
needed the current time to be greater in order to fail, but if it triggered
within that microsecond, the current time would be equal).  There are also
other potential issues around timer event reset and check that we did not
replicate, but did notice by inspection.

These cases appeared in a few different forms, so we factor out a "safe
reset" routine to ensure that a timer will not trigger (to be used before
setting up a PPI), and then we also factor out a routine to check whether a
timer has been missed.

This change is verified to fix some observed stability issues in PebbleOS
under heavy load.

Signed-off-by: Joshua Wise <[email protected]>
@jwise jwise force-pushed the jwise/fix-nrf5x-race-conditions branch from 1cf34c8 to 63e3946 Compare May 27, 2025 23:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/S Small PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants