-
Notifications
You must be signed in to change notification settings - Fork 39
Handle Inconsistent Response in Miner Api #221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I've set execution mode to fallback here, although I'm not convinced this is the best way to handle this situation. I think setting the health to partial content is probably sufficient. Let me know what you think! |
8a2b679
to
6ea2e32
Compare
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
Since #201 was closed I cherry-picked this back onto main. |
src/proxy.rs
Outdated
builder_client.forward(builder_req, builder_method), | ||
service.l2_client.forward(l2_req, method) | ||
); | ||
if builder_res.is_ok() != l2_res.is_ok() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If both builder and l2 are not_ok(), shouldn't we still update execution mode and health?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No because we can return that error to the caller, and they can handle that as they wish.
src/proxy.rs
Outdated
let mut execution_mode = service.execution_mode.lock(); | ||
if *execution_mode == ExecutionMode::Enabled { | ||
*execution_mode = ExecutionMode::Disabled; | ||
// Drop before aquiring health lock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The health status here isn't sticky. It will switch back to Healthy
on the next health check interval in the HealthHandle
if the builders unsafe head is up to date. So, I'm not sure if it's even worth updating the probes unless you can inform the Health handle not to update the health until the DA limits discrepancy is resolved.
Although it will likely trigger a conductor failover as is, so maybe worth it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally the health status here sticks until there is no discrepancy in DA limits between the sequencer, and the builder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think letting health status revert should be okay, as long as a new leader election is triggered.
I will check more in detail next week but can we add a unit test to check this behaviour? |
@ferranbt Yeah will definitely add tests. Was just waiting to get settled on the solution. Just pushed a couple commits and I think this is currently as good as we can do. Added a retry layer into our HTTP client which should hopefully prevent ever running into this situation. |
}; | ||
|
||
let clone = manager.clone(); | ||
tokio::spawn(async move { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider extracting this future to a separate async function - something like async fn handle_requests
let parts_clone = parts.clone(); | ||
let body_clone = body.clone(); | ||
let res_sender_clone = res_tx.clone(); | ||
tokio::spawn(async move { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the result of this future is immediately awaited on, why are we spawning a task here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I see - is it to isolate the inner part in case of a task abort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the case maybe we should drop a comment explaining it? Or consider an explicit cancellation token
let mut manager = self.clone(); | ||
let parts_clone = parts.clone(); | ||
let body_clone = body.clone(); | ||
match tokio::spawn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
let has_disabled_execution_mode = Arc::new(AtomicBool::new(false)); | ||
|
||
let manager = Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider putting the manager inside an Arc
and share the same instance across many tasks
let mut manager = self.clone(); | ||
let parts_clone = parts.clone(); | ||
let body_clone = body.clone(); | ||
let res_sender_clone = res_tx.clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary clone? We could just move res_tx
into the task?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there meant to be any assertions on this test? It would be nice to have it test the case where the builder fails to respond to the call but happy to have it as a follow up
req_rx.mark_unchanged(); | ||
res_rx.mark_unchanged(); | ||
|
||
let has_disabled_execution_mode = Arc::new(AtomicBool::new(false)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this initially set to false? shouldn't it be derived from the execution mode?
let l2_req = HttpRequest::from_parts(parts, HttpBody::from(body_bytes)); | ||
info!(target: "proxy::call", message = "forward request to default execution client", ?method); | ||
service.l2_client.forward(l2_req, method).await | ||
if method == MINER_SET_MAX_DA_SIZE { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be useful here to add docs why MINER_SET_MAX_DA_SIZE
is treated differently here
Ok(()) | ||
} | ||
Err(e) => { | ||
let msg = format!("failed to send request to l2: {e}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should self.probes.set_health
be set to unhealthy here?
drop(execution_mode); | ||
self.has_disabled_execution_mode | ||
.store(false, std::sync::atomic::Ordering::SeqCst); | ||
info!(target: "proxy::call", message = "setting execution mode to Enabled"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should self.probes.set_health
be reset here if the call failed previously
Err(e) => { | ||
let msg = format!("failed to send request to l2: {e}"); | ||
res_tx.send(Some(Err(e)))?; | ||
Err(eyre!(msg)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error logging would be useful here
{ | ||
Ok(_) => return Ok(()), | ||
Err(_) => { | ||
tokio::time::sleep(Duration::from_millis(200)).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error logging would be useful here
Audit findings:
All requests which are to be forwarded should not be treated equally. Specifically, there are eth_ and miner_ requests. If for some reason miner_ succeeds on the L2-EL but fails on the builder (or vice versa), it can result in discrepancy with critical information for consensus, such as GasPrice and GasLimit. This can lead to a block created by the builder with a higher gas limit than what L2-EL expects. As a result, the L2-EL will mark this block invalid and potentially flag all the future builder blocks as invalid invalid as well, effectively disabling builder functionality.
Recommendation(s): Increase the robustness of the forwarding mechanism. The miner_ should be forwarder to the builder only if the call to EL-L2 succeeds. If the builder fails to process the request, the failure should be logged. In addition, the system should set the execution_status to Disabled to prevent further issues.