-
Notifications
You must be signed in to change notification settings - Fork 438
nimble/phy/nrf5x: fix race conditions in TIMER0 setup #2036
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
nimble/phy/nrf5x: fix race conditions in TIMER0 setup #2036
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, few style comments
nimble/drivers/nrf5x/src/ble_phy.c
Outdated
* capture to it), and then clearing the event if it was set. | ||
*/ | ||
static void | ||
timer0_safely_reset(int timer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int cc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bumped in 63e3946.
nimble/drivers/nrf5x/src/ble_phy.c
Outdated
* to a scratch counter and comparing with wraparound. | ||
*/ | ||
static bool | ||
timer0_did_miss(int timer, int scratch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int cc, int scratch_cc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bumped in 63e3946.
nimble/drivers/nrf5x/src/ble_phy.c
Outdated
{ | ||
nrf_timer_task_trigger(NRF_TIMER0, nrf_timer_capture_task_get(scratch)); | ||
uint32_t ticks_rem = NRF_TIMER0->CC[timer] - NRF_TIMER0->CC[scratch]; | ||
return (ticks_rem >= 0x80000000) && !NRF_TIMER0->EVENTS_COMPARE[timer]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can simply compare NRF_TIMER0->CC[timer] < NRF_TIMER0->CC[scratch]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, this is a fix for the bug that I originally suspected I was hitting, but in the end I was not :-)
NRF_TIMER0->CC
is a uint32
. If scratch_cc
wrapped but cc
did not -- i.e., NRF_TIMER0->CC[cc]
is 0xFFFF_FFFE
but NRF_TIMER0->CC[scratch_cc]
is 0x0000_0002
, then it's not true that NRF_TIMER0->CC[cc] < NRF_TIMER0->CC[scratch_cc]
, but it is true that (NRF_TIMER0->CC[cc] - NRF_TIMER0->CC[scratch_cc]) >= 0x80000000
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIMER0
is reset on each event so in practice it cannot wrap around during an event, unless there's some scenario I'm not aware of.
Thanks! I'll make these changes tomorrow -- it's a holiday in the US today. |
In the nRF5x LL PHY, we use TIMER0 to trigger various Tx/Rx sequencing tasks. The PHY attempted to recover from some cases in which it was 'late' setting up TIMER0; in theory, these ought not have happened, but if interrupt priorities do not provide absolute priority for the LL (for instance, in RTOS systems that have other priorities, or scheduling systems that have incompatible interrupt priority in order to call into RTOS queue management routines), then the PHY can indeed end up failing to schedule a transmit / receive window in time. In some cases, the recovery process for these had race conditions, which could result in the LL state machine locking up (the end-of-tx interrupt would never fire). One race condition was extremely short (a few instructions wide; if the timer was set, but the timer triggered before the appropriate PPI could be configured); the other race condition was a microsecond wide (comparison to see if the timer had already triggered needed the current time to be greater in order to fail, but if it triggered within that microsecond, the current time would be equal). There are also other potential issues around timer event reset and check that we did not replicate, but did notice by inspection. These cases appeared in a few different forms, so we factor out a "safe reset" routine to ensure that a timer will not trigger (to be used before setting up a PPI), and then we also factor out a routine to check whether a timer has been missed. This change is verified to fix some observed stability issues in PebbleOS under heavy load. Signed-off-by: Joshua Wise <[email protected]>
1cf34c8
to
63e3946
Compare
In the nRF5x LL PHY, we use TIMER0 to trigger various Tx/Rx sequencing tasks. The PHY attempted to recover from some cases in which it was 'late' setting up TIMER0; in theory, these ought not have happened, but if interrupt priorities do not provide absolute priority for the LL (for instance, in RTOS systems that have other priorities, or scheduling systems that have incompatible interrupt priority in order to call into RTOS queue management routines), then the PHY can indeed end up failing to schedule a transmit / receive window in time.
In some cases, the recovery process for these had race conditions, which could result in the LL state machine locking up (the end-of-tx interrupt would never fire). One race condition was extremely short (a few instructions wide; if the timer was set, but the timer triggered before the appropriate PPI could be configured); the other race condition was a microsecond wide (comparison to see if the timer had already triggered needed the current time to be greater in order to fail, but if it triggered within that microsecond, the current time would be equal). There are also other potential issues around timer event reset and check that we did not replicate, but did notice by inspection.
These cases appeared in a few different forms, so we factor out a "safe reset" routine to ensure that a timer will not trigger (to be used before setting up a PPI), and then we also factor out a routine to check whether a timer has been missed.
This change is verified to fix some observed stability issues in PebbleOS under heavy load.
cc: @gmarull @crc32 @sjp4 @ericmigi