High RPC failure rate for eth_sendRawTransaction requests

Incident Report for Base

Postmortem

Summary:

Between 20:00UTC on 11/15/2024 and 14:00UTC on 11/16/2024, Base RPC routing services experienced higher than usual error rates when routing eth_sendRawTransaction requests. At any given time during the incident, 50% to 79% of eth_sendRawTransaction requests received `txpool is full` errors (code -32603).

Background:

On 11/15, we shifted traffic to a new txpool cluster. However, a there was a misconfiguration that limited the size of its txpool, preventing the cluster from correctly processing the entirety of incoming requests. As a secondary problem, our routing systems did not correctly attribute the `txpool is full` errors coming from backend nodes, and thus did not emit the error metrics required to alert us of this situation.

Going forward:

We've addressed problems with the txpool cluster so that it can properly process the incoming volume of requests. Next, we're prioritizing short term improvements to our monitoring and incident management processes so that issues are caught and communicated publicly as soon as possible.

Posted Nov 22, 2024 - 05:00 UTC

Resolved

Posted Nov 15, 2024 - 20:00 UTC