Skip to content

Handle Inconsistent Response in Miner Api #221

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

0xForerunner
Copy link
Collaborator

Audit findings:

All requests which are to be forwarded should not be treated equally. Specifically, there are eth_ and miner_ requests. If for some reason miner_ succeeds on the L2-EL but fails on the builder (or vice versa), it can result in discrepancy with critical information for consensus, such as GasPrice and GasLimit. This can lead to a block created by the builder with a higher gas limit than what L2-EL expects. As a result, the L2-EL will mark this block invalid and potentially flag all the future builder blocks as invalid invalid as well, effectively disabling builder functionality.

Recommendation(s): Increase the robustness of the forwarding mechanism. The miner_ should be forwarder to the builder only if the call to EL-L2 succeeds. If the builder fails to process the request, the failure should be logged. In addition, the system should set the execution_status to Disabled to prevent further issues.

@0xForerunner
Copy link
Collaborator Author

I've set execution mode to fallback here, although I'm not convinced this is the best way to handle this situation. I think setting the health to partial content is probably sufficient. Let me know what you think!

@0xForerunner 0xForerunner force-pushed the forerunner/inconsistant-miner-api branch from 8a2b679 to 6ea2e32 Compare May 15, 2025 20:25
Copy link

vercel bot commented May 15, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
rollup-boost ⬜️ Ignored (Inspect) Visit Preview May 22, 2025 3:12pm

@0xForerunner 0xForerunner changed the base branch from kit/engine-loop to main May 15, 2025 20:26
@0xForerunner
Copy link
Collaborator Author

Since #201 was closed I cherry-picked this back onto main.

src/proxy.rs Outdated
builder_client.forward(builder_req, builder_method),
service.l2_client.forward(l2_req, method)
);
if builder_res.is_ok() != l2_res.is_ok() {
Copy link

@karankurbur karankurbur May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If both builder and l2 are not_ok(), shouldn't we still update execution mode and health?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No because we can return that error to the caller, and they can handle that as they wish.

src/proxy.rs Outdated
let mut execution_mode = service.execution_mode.lock();
if *execution_mode == ExecutionMode::Enabled {
*execution_mode = ExecutionMode::Disabled;
// Drop before aquiring health lock
Copy link
Collaborator

@0xOsiris 0xOsiris May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The health status here isn't sticky. It will switch back to Healthy on the next health check interval in the HealthHandle if the builders unsafe head is up to date. So, I'm not sure if it's even worth updating the probes unless you can inform the Health handle not to update the health until the DA limits discrepancy is resolved.

Although it will likely trigger a conductor failover as is, so maybe worth it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally the health status here sticks until there is no discrepancy in DA limits between the sequencer, and the builder

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think letting health status revert should be okay, as long as a new leader election is triggered.

@ferranbt
Copy link
Collaborator

I will check more in detail next week but can we add a unit test to check this behaviour?

@0xForerunner
Copy link
Collaborator Author

@ferranbt Yeah will definitely add tests. Was just waiting to get settled on the solution. Just pushed a couple commits and I think this is currently as good as we can do. Added a retry layer into our HTTP client which should hopefully prevent ever running into this situation.

@0xForerunner 0xForerunner requested review from 0xOsiris and ferranbt May 21, 2025 02:48
};

let clone = manager.clone();
tokio::spawn(async move {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider extracting this future to a separate async function - something like async fn handle_requests

let parts_clone = parts.clone();
let body_clone = body.clone();
let res_sender_clone = res_tx.clone();
tokio::spawn(async move {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the result of this future is immediately awaited on, why are we spawning a task here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I see - is it to isolate the inner part in case of a task abort?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the case maybe we should drop a comment explaining it? Or consider an explicit cancellation token

let mut manager = self.clone();
let parts_clone = parts.clone();
let body_clone = body.clone();
match tokio::spawn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


let has_disabled_execution_mode = Arc::new(AtomicBool::new(false));

let manager = Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider putting the manager inside an Arc and share the same instance across many tasks

let mut manager = self.clone();
let parts_clone = parts.clone();
let body_clone = body.clone();
let res_sender_clone = res_tx.clone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary clone? We could just move res_tx into the task?

@ferranbt ferranbt added the audit label May 22, 2025
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there meant to be any assertions on this test? It would be nice to have it test the case where the builder fails to respond to the call but happy to have it as a follow up

req_rx.mark_unchanged();
res_rx.mark_unchanged();

let has_disabled_execution_mode = Arc::new(AtomicBool::new(false));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this initially set to false? shouldn't it be derived from the execution mode?

let l2_req = HttpRequest::from_parts(parts, HttpBody::from(body_bytes));
info!(target: "proxy::call", message = "forward request to default execution client", ?method);
service.l2_client.forward(l2_req, method).await
if method == MINER_SET_MAX_DA_SIZE {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be useful here to add docs why MINER_SET_MAX_DA_SIZE is treated differently here

Ok(())
}
Err(e) => {
let msg = format!("failed to send request to l2: {e}");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should self.probes.set_health be set to unhealthy here?

drop(execution_mode);
self.has_disabled_execution_mode
.store(false, std::sync::atomic::Ordering::SeqCst);
info!(target: "proxy::call", message = "setting execution mode to Enabled");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should self.probes.set_health be reset here if the call failed previously

Err(e) => {
let msg = format!("failed to send request to l2: {e}");
res_tx.send(Some(Err(e)))?;
Err(eyre!(msg))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error logging would be useful here

{
Ok(_) => return Ok(()),
Err(_) => {
tokio::time::sleep(Duration::from_millis(200)).await;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error logging would be useful here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants