SSL SYSCALL error: EOF detected — It Wasn't Postgres – Sylvain Artois | postgresql, tcp, networking, libpq, debugging

If you’ve ever seen SSL SYSCALL error: EOF detected on a long-running Postgres query over a remote connection and blamed SSL, Postgres, or your managed provider — this one’s for you. None of them did it.

I hit this while running search infrastructure for AFK.live on a managed Scaleway PostgreSQL instance. The full story of the search engine is in Full-Text Search on a 2 GB PostgreSQL Instance — this is the networking bug I waved away there as “an SSL issue” and only root-caused later. It turned out to have nothing to do with SSL.

The Symptom

A materialized view feeds the search. To refresh it without locking out queries, you want:

REFRESH MATERIALIZED VIEW CONCURRENTLY headlines_search;

Run against the remote managed instance, it died every time — but only after running for a while:

ERROR:  SSL SYSCALL error: EOF detected

The non-concurrent REFRESH MATERIALIZED VIEW worked fine. So did every short query. Only the long-running concurrent refresh got cut off. That pattern — short queries fine, long queries killed — is the whole tell, and I missed it for months.

The Wrong Theory

“SSL SYSCALL error: EOF detected” reads like a TLS problem. It isn’t. EOF detected means the socket was closed underneath the connection — libpq¹ went to read the next bytes and found the TCP stream already gone. SSL is just the layer that noticed. The error is the messenger.

The real question is: who closed the socket, and why only on long queries?

What’s Actually Happening

REFRESH ... CONCURRENTLY runs for minutes and, crucially, sends no bytes on the socket the entire time. The client issued the command and is now blocked waiting for a single result. The server is busy rebuilding the index. In both directions, the connection is completely silent — no data, no protocol chatter, nothing.

Now look at the path between client and database. My self-hosted box reaches the Scaleway managed instance over the public internet, through:

home NAT (the router’s connection-tracking table), then
a load balancer sitting in front of the managed database.

Both are stateful intermediaries. They keep a table of active TCP flows, and to avoid that table growing forever, they expire idle flows — typically after 5–15 minutes of seeing no packets. When a flow expires, the intermediary drops it from its table and, often, sends an RST. The next time either side speaks, the connection is already dead.

Here’s the kicker — TCP has a built-in mechanism to prevent exactly this: keepalives, empty packets sent on idle connections to keep the flow warm. libpq supports them. But the default for keepalives_idle is whatever the OS sets, and Postgres’s own server-side tcp_keepalives_idle defaults to 7200 seconds — 2 hours. So nothing probes the connection for two hours, while the NAT gives up after ten minutes.

The timeline writes itself:

t=0      client sends REFRESH ... CONCURRENTLY
t=0..n   server works. socket silent in both directions.
t≈10min  NAT/LB idle timeout fires → flow evicted, RST
t=2h     first keepalive would have fired (too late, by ~110 min)
t=n      server finishes, sends result → connection already dead
         → client: SSL SYSCALL error: EOF detected

Short queries finished inside the idle window, so they never tripped it. Long ones got guillotined mid-flight. Textbook idle-timeout teardown — and the database was never at fault.

The Fix Is Connection Config, Not a Postgres Setting

You don’t fix this in postgresql.conf or by touching the query. You fix it by telling the client to send aggressive TCP keepalives, so a packet flows every 30 seconds and the NAT/LB never sees the flow as idle:

keepalives=1 keepalives_idle=30 keepalives_interval=10 keepalives_count=5

Read that as: enable keepalives, start probing after 30s of idle, repeat every 10s, give up after 5 failed probes. With a probe every 30 seconds, the connection-tracking table stays warm and the flow never gets evicted.

Set it everywhere a long-lived or remote connection exists — not just the one place you saw the error. In my stack that meant three spots:

psql in the deploy script — as a conninfo string:

psql "host=$DB_HOST port=$DB_PORT user=$DB_USER dbname=$DB_NAME \
      keepalives=1 keepalives_idle=30 keepalives_interval=10 keepalives_count=5" \
  -c "REFRESH MATERIALIZED VIEW CONCURRENTLY headlines_search;"

The API’s psycopg2 connection — same keywords, as connect args:

conn = psycopg2.connect(
    host=DB_HOST, port=DB_PORT, user=DB_USER, dbname=DB_NAME,
    keepalives=1, keepalives_idle=30,
    keepalives_interval=10, keepalives_count=5,
)

The SSR connection pool (node-postgres) — different API, same idea:

new Pool({
  host: DB_HOST,
  // ...
  keepAlive: true,
  keepAliveInitialDelayMillis: 30000,
});

With keepalives on, REFRESH MATERIALIZED VIEW CONCURRENTLY ran for 2m23s over the remote link without a single EOF. Problem gone.

The Lesson

SSL SYSCALL error: EOF detected is one of the most misleading errors in the Postgres world. It names SSL, but SSL is just the layer that discovered a corpse. The actual culprit is almost always a stateful network device killing an idle TCP flow — a NAT, a load balancer, a firewall, a cloud gateway — combined with keepalive defaults measured in hours.

The diagnostic shortcut: if short queries succeed and long, quiet ones fail, you’re not looking at a database bug. You’re looking at an idle connection being reaped between you and the server. Reach for keepalives before you reach for the Postgres docs.

This is a footnote to Full-Text Search on a 2 GB PostgreSQL Instance, part of a series about building AFK.live, a news aggregation platform. I’m a senior engineer learning ML and data engineering, documenting the process — including the bugs.

libpq is the official C client library for PostgreSQL — the layer that actually opens the TCP socket, negotiates SSL, and speaks the wire protocol to the server. Most language drivers sit on top of it (psycopg2 for Python, the psql CLI itself), which is why the same keepalives_* connection parameters work across all of them. The keepalive and SSL behaviour described here lives in libpq, not in Postgres the server. See the official connection-parameters documentation. ↩

SSL SYSCALL error: EOF detected — It Wasn't Postgres

The Symptom

The Wrong Theory

What’s Actually Happening

The Fix Is Connection Config, Not a Postgres Setting

The Lesson

Footnotes