Description
Recently, I ran into a weird issue in production where sometimes our Worker would just explode, and throw obscure CONNECT_TIMEOUT
errors to Postgres, despite working previously for months. Our production environment was down, but with assistance from @WalshyDev and hours of debugging, we found that massively reducing our number of PG connections resolved the problem.
However, take this example worker:
import postgres from 'postgres';
export default {
async fetch(request, env, ctx): Promise<Response> {
const connection = postgres('postgres://reader:[email protected]:5432/pfmegrnargs');
const results = await Promise.all([
connection`SELECT 1`,
connection`SELECT 2`,
connection`SELECT 3`,
connection`SELECT 4`,
connection`SELECT 5`,
connection`SELECT 6`,
connection`SELECT 7`,
connection`SELECT 8`,
connection`SELECT 9`,
connection`SELECT 10`,
]);
ctx.waitUntil(connection.end());
return Response.json(results);
},
} satisfies ExportedHandler<Env>;
Locally, this works without issue. However if you deploy this to production Workers, it'll end up in an exception being thrown as can be seen at https://postgres-deadlock-example.jross.workers.dev/, and errors like this in the logs:
I suspect this is something to do with the 6 concurrent connections that Workers allow, before subsequent ones are queued, but considering that these DB connections aren't closed until the end of the requests, they're hanging open forever and causing a deadlock? This is a pretty common pattern in production apps that use Hono, Next.js, etc - you have a function that either returns a db
instance, or creates one and then sets it on some reusable storage like context
, ALS, etc. so you don't constantly spin up and down new connections to the DB.
The default postgres
max
connections is 10, which is what I've been using up to this point in my works since all the Cloudflare docs don't specify any options in all of their docs examples, and I believe I've seen folks mention that defaults are fine. If I drop max
to 3
, with something like postgres(..., { max: 3 })
, the problem doesn't happen because only 3 connections are being spun up by the driver.
I reported this to a few folks, and there's a couple of changes in docs and postgres
now at:
- Add some notes to hyperdrive docs cloudflare-docs#19616
- Default connections to less than 10 in Cloudflare Workers environment to avoid deadlock porsager/postgres#1023
Ideally, with TCP connections, deadlocks would be logged in some way that's much easier to debug. Chasing obscure timeout errors led to the most confusion here.