-
Notifications
You must be signed in to change notification settings - Fork 659
[BugFix] race condition [is_fetching] causing multiple fetch requests #5238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #5238 +/- ##
==========================================
Coverage ? 60.53%
==========================================
Files ? 320
Lines ? 39054
Branches ? 5871
==========================================
Hits ? 23642
Misses ? 13549
Partials ? 1863
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
juncaipeng
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR addresses a race condition in the request scheduling logic where multiple fetch requests could be submitted concurrently due to the is_fetching flag being set inside the worker thread instead of before task submission. The fix correctly moves the flag assignment to occur immediately before the submit() call, eliminating the race window.
Key changes:
- Fixed race condition by setting
is_fetching = Truebeforesubmit()instead of inside the worker thread - Corrected port type from
strtointin router configuration for proper type consistency - Removed duplicate
paddle.to_tensor()call in resource manager
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
fastdeploy/router/router.py |
Changed port parameter type from str to int for type correctness |
fastdeploy/engine/sched/resource_manager_v1.py |
Removed duplicate paddle.to_tensor() call that was creating the same tensor twice |
fastdeploy/engine/common_engine.py |
Fixed race condition by moving is_fetching = True assignment from inside _fetch_request() to before thread pool submission, preventing multiple concurrent fetch requests. Applied fix to both "mixed" and non-"mixed" execution paths |
Comments suppressed due to low confidence (1)
fastdeploy/engine/common_engine.py:817
- Critical bug: If
get_request_pool.submit(_fetch_request)raises aRuntimeErrorafteris_fetching = Trueis set, the exception handler at lines 812-817 will either break or re-raise the exception without resettingis_fetchingback toFalse. This will permanently block all future fetch request submissions. You need to resetis_fetching = Falsebefore thebreakstatement (line 815) and before theraisestatement (line 817) to prevent the flag from being stuck in theTruestate.
is_fetching = True
get_request_pool.submit(_fetch_request)
except RuntimeError as e:
if "shutdown" in str(e):
llm_logger.info("Thread pool shutdown detected, exiting scheduler loop")
break
else:
raise
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
| is_fetching = True | ||
| get_request_pool.submit(_fetch_request) | ||
|
|
Copilot
AI
Nov 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical bug: If get_request_pool.submit(_fetch_request) at line 800 throws an exception after is_fetching = True is set at line 799, the exception will be caught by the general exception handler (lines 859-861 in the full context), which logs the error but doesn't reset is_fetching back to False. This will permanently block all future fetch request submissions. Consider wrapping lines 799-800 in a try-except block that ensures is_fetching is reset to False if the submission fails, similar to how the "mixed" role path handles this at lines 809-817.
| is_fetching = True | |
| get_request_pool.submit(_fetch_request) | |
| try: | |
| is_fetching = True | |
| get_request_pool.submit(_fetch_request) | |
| except RuntimeError as e: | |
| is_fetching = False | |
| if "shutdown" in str(e): | |
| llm_logger.info("Thread pool shutdown detected, exiting scheduler loop") | |
| break | |
| else: | |
| self.llm_logger.error(f"fetching request error {e} {str(traceback.format_exc())}") | |
| except Exception as e: | |
| is_fetching = False | |
| self.llm_logger.error(f"fetching request error {e} {str(traceback.format_exc())}") |
Motivation
Modifications
router.log traceback, port type error
race condition [is_fetching] causing multiple fetch requests
Usage or Command
bash start_v1_tp1.sh
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.