Full-Text Search on a 2 GB PostgreSQL Instance

AFK.live aggregates headlines from 250+ news sources. I needed search. I had a 2 GB database and no budget for Elasticsearch.

Fair warning: this article goes deep into PostgreSQL internals — query plans, GIN indexes, trigram scoring. I’ve added links throughout for context. If you’re not familiar with these concepts, the links are there for you. If you are, skip them and enjoy the war stories.

The Constraint

AFK runs on a Scaleway managed PostgreSQL instance — the smallest tier they offer:

DB-PLAY2-PICO: 1 vCPU, 2 GB RAM
Cost: roughly 70% of the entire AFK hosting budget (50€)

No PGVector. No Elasticsearch. No Typesense. Just PostgreSQL 16 with the pg_trgm and unaccen extensions, which Scaleway includes by default on managed instances.

The headlines table has grown past 1 million rows. I needed fuzzy search that handles French accents, returns results in under 300ms, and doesn’t bring the tiny instance to its knees.

Here’s how I got there — and the bugs I hit along the way.

Building the Foundation: A Materialized View

Searching directly against the headlines table would mean joining with sources on every query, and computing lower(unaccent(title)) at query time for every candidate row. On a 1 vCPU machine, that’s not going to work.

The solution is a materialized view that pre-computes the normalized search field:

CREATE MATERIALIZED VIEW IF NOT EXISTS headlines_search AS
SELECT
    h.id,
    h.title,
    h.url,
    h.published_at,
    h.source_slug,
    lower(unaccent(h.title)) as search_title
FROM headlines h
JOIN sources s ON h.source_slug = s.slug
WHERE
    s.exclude_from_search = false
    AND h.published_at IS NOT NULL;

search_title is pre-lowercased and unaccented at materialization time, not at query time. The view also filters out sources I’ve flagged as excluded (far-right sources).

On top of this, three indexes:

-- GIN trigram index for fuzzy text matching
CREATE INDEX idx_headlines_search_title_gin
ON headlines_search USING gin(search_title gin_trgm_ops);

-- For chronological ordering and date range filters
CREATE INDEX idx_headlines_search_published_at
ON headlines_search(published_at DESC);

-- For source filtering
CREATE INDEX idx_headlines_search_source_slug
ON headlines_search(source_slug);

The GIN trigram index is the core of the search: it breaks every search_title into 3-character grams and indexes them, enabling fuzzy matching without full table scans.

Refreshing the View

A materialized view is a snapshot. New headlines are invisible until you refresh it. AFK’s CI/CD pipeline deploys every two hours, so I added the refresh as the first step of the deployment script:

# deploy.legacy.sh — runs before Astro build
env PGPASSWORD="${SCALEWAY_DB_PASSWORD}" psql \
  -h "${SCALEWAY_DB_HOST}" \
  -p "${SCALEWAY_DB_PORT}" \
  -U "${SCALEWAY_DB_USER}" \
  -d "${SCALEWAY_DB_NAME}" \
  -c "REFRESH MATERIALIZED VIEW headlines_search;" \
  && echo "headlines_search refreshed successfully" \
  || echo "headlines_search refresh failed, continuing with stale data"

A note on REFRESH MATERIALIZED VIEW CONCURRENTLY: it exists, and I have the unique index it requires. But over the remote connection to Scaleway’s managed instance, the concurrent refresh kept hitting SSL EOF errors on longer operations. The non-concurrent variant works reliably — but it locks the view for the duration of the refresh. During that lock, any search query hitting the view will fail. We deploy during business hours, so this is a real trade-off: a few seconds of downtime every two hours. On a site with a few concurrent users, that’s acceptable. At larger scale, you’d want to schedule refreshes overnight and accept 24 hours of staleness — or fix the SSL issue and use CONCURRENTLY.

Bug #1: The View That Was Never Refreshed

The search page went live. It worked. Then a user pointed out that no results appeared after November 2025.

The materialized view had never been refreshed since its creation. It had 357K rows. The headlines table had over 1 million. Two-thirds of the articles were invisible to search.

The frustrating part: I had planned for this. The git history tells the story. The initial commit (October 2025) didn’t just create the view — it also created a refresh_headlines_search() PL/pgSQL function and configured two Ofelia scheduler jobs in scheduler-config.ini: one for weekdays (refreshing nine times a day, from 6:15 to 22:15 CET) and one for Sundays (five times a day). The infrastructure was designed correctly.

But the scheduler jobs referenced environment variables (${SCALEWAY_DB_ADMIN_PASSWORD}) that were never injected into the container. The jobs ran on schedule, failed silently, and nobody noticed for five months. The refresh_headlines_search() function I carefully wrote in the migration? Never called once.

Twenty years of engineering experience, and I forgot to check that my cron jobs actually worked. The fix was the deploy script change above — crude but reliable, because it runs in the same environment that already has the database credentials.

The lesson: materialized views are a pit of success when they work, and a silent disaster when they go stale. There’s no built-in warning. If you use them, monitor the row count or the most recent published_at — don’t assume your refresh mechanism works just because it exists.

Bug #2: The 4.4-Second Query

After the refresh brought the view to ~966K rows, the next problem appeared immediately. Searching “donald trump” took 4.4 seconds and triggered an HTTP 504 timeout. (The short timeout is intentional — I’d rather show an error than let a slow query go unnoticed. If search is too slow, I want to know about it before users start complaining.)

Here’s what the original query looked like:

SELECT hs.id, hs.title, hs.url, hs.published_at,
       s.name, s.logo, s.url, s.slug,
       similarity(hs.search_title, lower(unaccent($1))) as score
FROM headlines_search hs
JOIN sources s ON hs.source_slug = s.slug
WHERE hs.search_title % lower(unaccent($1))
ORDER BY hs.published_at DESC, score DESC
LIMIT 50;

The % operator is pg_trgm’s similarity operator. It uses the GIN index. Sounds efficient. Let’s look at EXPLAIN ANALYZE:

Bitmap Index Scan on idx_headlines_search_title_gin
  Index Cond: (search_title % lower(unaccent('donald trump')))
  rows=55508

Bitmap Heap Scan on headlines_search hs
  Rows Removed by Index Recheck: 55351
  Heap Blocks: exact=32113
  Execution Time: 4443ms

The GIN index returned 55,508 candidates. After the recheck filter, only 157 survived — 99.7% noise. PostgreSQL had to load 32,113 heap blocks (a heap block is an 8 KB page of actual table data — 32K blocks means ~250 MB of disk reads) to evaluate the similarity predicate on each candidate. On a 2 GB instance, that’s a significant chunk of available memory spent on I/O.

The problem is how GIN trigram indexes work: they find all rows that share any trigrams with the query. For a common term like “donald trump”, that’s a lot of rows.

The Fix: `strict_word_similarity`

PostgreSQL’s pg_trgm extension offers three similarity functions, and they behave very differently:

Function	What it measures	Operator
`similarity(a, b)`	How similar the full strings are	`%`
`word_similarity(a, b)`	Whether `a` matches any word in `b`	`<%`
`strict_word_similarity(a, b)`	Whether `a` matches a contiguous word sequence in `b`	`<<%`

similarity compares the entire query against the entire title — terrible for short queries against long titles. word_similarity is more lenient but still too broad. strict_word_similarity requires the query to match a contiguous sequence of words in the title. That’s exactly what you want when someone searches for a proper noun like “donald trump” or “réchauffement climatique”.

The <<% operator also changed the query plan. Instead of the GIN bitmap scan, PostgreSQL used an incremental sort on the published_at index — scanning from most recent to oldest and stopping once it had enough results:

SELECT hs.id, hs.title, hs.url, hs.published_at,
       s.name, s.logo, s.url, s.slug,
       strict_word_similarity(lower(unaccent($1)), hs.search_title) as score
FROM headlines_search hs
JOIN sources s ON hs.source_slug = s.slug
WHERE lower(unaccent($1)) <<% hs.search_title
ORDER BY hs.published_at DESC, score DESC
LIMIT 50;

The results:

Version	Execution time	Index used
Original (`%` + similarity)	4,444ms	GIN (55K candidates)
Raised threshold to 0.3	836ms	GIN (55K candidates)
`word_similarity` `<%`	765ms	published_at
`strict_word_similarity` `<<%`	221ms	published_at

From 4.4 seconds to 221ms. The 2 GB instance can handle that.

The trade-off is real, though: strict_word_similarity matches contiguous word sequences, not approximate ones. It won’t forgive typos the way similarity does — “donlad trump” returns nothing. You’re trading fuzzy tolerance for speed and precision. For a news search where users type proper nouns they already know how to spell, that’s the right call. For a general-purpose search box, it might not be.

The Relevance Problem

Fast isn’t useful if the results are wrong. A user reported that searching “emmanuel macron” returned headlines about Emmanuel Bellanger, Emmanuel Grégoire, and Emmanuel-Philibert de Savoie mixed in with actual Macron results.

I tested with synthetic data to understand the scoring:

Title	`strict_word_similarity` score	Matches?
“emmanuel macron annonce une réforme”	1.0	yes
”emmanuel bellanger critique le gouvernement”	0.5625	yes
”emmanuel grégoire quitte la mairie de paris”	0.5625	yes
”macron reçoit biden à l’élysée”	0.4375	no

The issue: “emmanuel” alone is 8 characters out of the 16-character query. That’s enough trigram overlap to produce a score of 0.5625 — well above the default pg_trgm.strict_word_similarity_threshold of 0.3. Any “Emmanuel X” headline passes the filter.

On the real dataset, the score distribution confirmed it:

Score	Count	What they are
1.00	4,246	Actual “Emmanuel Macron” headlines
0.56	977	”Emmanuel [someone else]” noise

19% noise. And the problem is specific to queries containing common first names — “réchauffement climatique” worked perfectly because both words are long and distinctive.

Threshold Tuning: SET vs WHERE

Two ways to raise the bar:

Option A — SET the threshold before the query:

SET pg_trgm.strict_word_similarity_threshold = 0.8;
SELECT ... WHERE lower(unaccent($1)) <<% hs.search_title ...

Option B — add a WHERE clause:

SELECT ... WHERE lower(unaccent($1)) <<% hs.search_title
  AND strict_word_similarity(...) >= 0.8 ...

They produce the same results but perform very differently. The SET approach tells the GIN index to use 0.8 as its scan threshold — it returns 4,295 candidates. The WHERE approach uses the default 0.3 for the index scan (8,460 candidates), then discards rows in the heap.

-- SET approach: 166ms, 4,295 index candidates
-- WHERE approach: 348ms, 8,460 index candidates

The SET is 2x faster because it pushes the selectivity into the index. The catch is that SET is a session-level command — it changes the threshold for the entire database connection, not just one query. If you’re using a connection pool (and you should be), that modified connection goes back into the pool after your request. The next request that picks it up inherits the 0.8 threshold, which might not be what it expects.

How we handle this is covered in the next section.

The API Endpoint

Everything comes together in a single Astro SSR endpoint at /api/search/headlines. Here’s the core flow:

// Get a dedicated client from the pool
const client = await databaseService.getClient();
try {
  // Set threshold only when there's a text query
  if (hasQuery) {
    await databaseService.executeQueryWithClient(
      client,
      "SET pg_trgm.strict_word_similarity_threshold = 0.8",
    );
  }

  // Execute search and count queries on the same client
  results = await databaseService.executeQueryWithClient(client, query, params);
  const countResult = await databaseService.executeQueryWithClient(
    client,
    countQuery,
    countParams,
  );
  totalCount = parseInt(countResult[0]?.total_count || "0", 10);
} finally {
  client.release();
}

The SET command only fires when there’s a text query. If the user is just browsing by source or date — no text, no threshold change.

The finally block is where we handle the session-level SET problem. When client.release() returns the connection to the pool, it carries the 0.8 threshold with it. In our case, this is fine: the search endpoint is the only consumer of strict_word_similarity, so every text search sets the threshold before querying. No other code path depends on the default 0.3. If it did, we’d need to SET pg_trgm.strict_word_similarity_threshold = DEFAULT before releasing — or use SET LOCAL inside a transaction, which scopes the setting to that transaction only.

The query is built dynamically based on which filters are present:

// Text search: strict_word_similarity with <<% operator
if (hasQuery) {
  conditions.push(`lower(unaccent($${paramCounter})) <<% hs.search_title`);
}

// Optional source, date_from, date_to filters
if (source_slug) {
  conditions.push(`hs.source_slug = $${paramCounter}`);
}

The scoring column adapts too: strict_word_similarity(...) when there’s a text query, 0 otherwise. Results are ordered by published_at DESC, score DESC — most recent first, highest relevance as tiebreaker.

Pagination uses LIMIT/OFFSET with a separate COUNT(*) query for the total. Responses get a 5-minute Cache-Control header since the view only changes at deploy time anyway.

The Frontend

The search page is an Astro SSR page with an Alpine.js component handling all client-side interaction. No framework overhead — Alpine weighs ~15 KB and feels right for a form with a few filters.

The component manages four concerns:

Source autocomplete: Sources are loaded from an Astro content collection into a JSON data island at build time. The autocomplete filters client-side — no API call needed. Selecting a source shows it as a chip with the outlet’s favicon.

Date range filters: Two date inputs that map directly to date_from and date_to query parameters. Simple, but together with the source filter they allow powerful browsing: “show me everything from Le Monde in March 2026”.

URL state sync: Every search updates the browser URL via history.replaceState() — ?q=macron&source_slug=lemonde&page=2. Bookmarkable, shareable. On page load, init() reads the URL params back and re-executes the search.

Pagination: Page state is managed in Alpine and synced to the URL. The API returns total_count, Alpine computes totalPages, and the UI shows Previous/Next controls. Page changes trigger a new API call and smooth-scroll back to the top.

async executeSearch() {
  const params = new URLSearchParams();
  if (this.query) params.set('q', this.query);
  params.set('limit', String(this.pageSize));
  params.set('offset', String((this.currentPage - 1) * this.pageSize));

  if (this.selectedSource) params.set('source_slug', this.selectedSource.slug);
  if (this.dateFrom) params.set('date_from', this.dateFrom);
  if (this.dateTo) params.set('date_to', this.dateTo);

  // Update browser URL without reload
  history.replaceState(null, '', `?${browserParams.toString()}`);

  const response = await fetch(`/api/search/headlines/?${params.toString()}`);
  // ...
}

No debounce, no search-as-you-type. The user submits the form, sees results. Fast enough that it feels instant.

What a 2 GB Database Can Do

The final stack:

Materialized view with pre-computed lower(unaccent(title)) — avoids runtime text normalization on 966K rows
GIN trigram index — handles fuzzy matching with French accents
strict_word_similarity with <<% operator — 20x faster than similarity for multi-word queries
Session-level threshold at 0.8 — pushes selectivity into the index, eliminates 19% noise on name queries
Astro SSR + Alpine.js — server-side API, lightweight client
5-minute cache — the view only changes at deploy time anyway

Average query time: ~200ms for text search, faster for filter-only browsing. On the smallest managed PostgreSQL instance Scaleway offers.

The lesson is boring but worth repeating: before reaching for a search service, check what your existing database can do. PostgreSQL’s pg_trgm isn’t Elasticsearch — there’s no stemming, no typo tolerance beyond trigram overlap, no field boosting. But for searching a million news headlines in a single language, it’s more than enough.

The real optimization work wasn’t about choosing the right tool. It was about reading EXPLAIN ANALYZE output, understanding how GIN indexes actually scan, and tuning a single threshold value.

Addendum (2026-06-05) — the CONCURRENTLY SSL EOF, root-caused and fixed.

The note earlier in Refreshing the View framed the concurrent-refresh SSL SYSCALL error: EOF detected as a trade-off I’d worked around with the blocking variant. I found the root cause since — and it isn’t Postgres.

A REFRESH ... CONCURRENTLY runs for minutes and sends no bytes on the socket until it finishes: the connection is silent in both directions. My self-hosted box reaches the Scaleway managed instance over the public internet, through home NAT and a load balancer in front of the database. Stateful intermediaries like those expire idle TCP flows (often after 5–15 min) — and libpq’s keepalives default to tcp_keepalives_idle = 7200s (2 hours), so nothing probes the connection before the NAT declares it dead and sends an RST. The client surfaces it as SSL SYSCALL error: EOF detected. Short queries finished before the idle timeout; long ones got cut mid-flight. Textbook.

The fix is connection config, not a Postgres setting — aggressive client-side TCP keepalives so a packet flows every 30s and keeps the flow warm:
keepalives=1 keepalives_idle=30 keepalives_interval=10 keepalives_count=5
Set it everywhere a long or remote connection lives: the psql invocation in the deploy script (as a conninfo string), the API’s psycopg2.connect(...), and the SSR pool (keepAlive: true, keepAliveInitialDelayMillis: 30000 in node-postgres). With that, REFRESH MATERIALIZED VIEW CONCURRENTLY ran for 2m23s over the remote link without a single EOF.

Two more changes made CONCURRENTLY the default I actually keep:

Bounded the view to a rolling 6-month window — ... AND h.published_at >= now() - interval '6 months', re-evaluated on every refresh (the window slides). No evidence users paginate past recent results, and it cut the GIN rebuild from ~1.3M to ~791K rows.

Moved off DB-PLAY2-PICO to a 3 vCPU / 4 GB tier. On a single core, CONCURRENTLY’s ~2× work isn’t worth it — a plain refresh is shorter and the brief lock is tolerable. With spare cores, CONCURRENTLY rebuilds on one while the others keep serving search: zero downtime at refresh time, which was the whole point of wanting it.

So the honest answer to the cliffhanger above — “fix the SSL issue and use CONCURRENTLY” — turned out to be: set keepalives, shrink the view, and give it a second core.

This article is part of a series about building AFK.live, a news aggregation platform. I’m a senior engineer learning ML and data engineering, documenting the process — including the bugs.