by Sylvain Artois on Aug 27, 2025
I built myself a little hyperparameter tuning setup for BERTopic to evaluate the best possible combinations for umap / hdbscan settings and a few BERTopic parameters.
The script spins up a Python subprocess for each possible combination, this kind of thing:
{
"umap_n_neighbors": [12, 14, 16],
"umap_n_components": [7, 8, 9, 10, 11],
"umap_min_dist": [0.1],
"hdbscan_min_cluster_size": [3],
"hdbscan_min_samples": [8, 9, 10],
"hdbscan_cluster_selection_epsilon": [0.3, 0.5],
"bertopic_top_n_words": [12],
"nr_topics": ["none"],
"n_gram_max": [2, 3],
"embedding_model": ["dangvantuan/sentence-camembert-large"],
"remove_stopwords": [false]
}
To evaluate the clustering, I need a ground truth, hand-labeled, to compare each tested parameter combination against the manual clustering.
But it’s tedious. I have 9 headline categories (setfit), which potentially means 9 BERTopic configs, and therefore 9 datasets… Not easy.
I wondered if Claude Sonnet could do the clustering for me.
I feed it raw data, a SQL query with proportional sampling by source to be somewhat representative, this kind of horror that would’ve been unthinkable before having a DBA bot permanently on tap:
WITH distinct_headlines AS (
SELECT DISTINCT h.id, h.title, h.source_slug
FROM headlines h
JOIN headlines_batches hb ON h.id = hb.headline_id
JOIN parsing_runs pr ON hb.batch_id = pr.batch_id
JOIN sources s ON h.source_slug = s.slug
),
ranked_headlines AS (
SELECT id, title, source_slug,
ROW_NUMBER() OVER (PARTITION BY source_slug ORDER BY RANDOM()) as rn,
CEIL(COUNT(*) OVER (PARTITION BY source_slug) * 0.66) as max_per_source
FROM distinct_headlines
)
SELECT json_agg(
json_build_object(
'id', id,
'title', title,
'topic_id', null
)
) AS headlines
FROM ranked_headlines
WHERE rn <= max_per_source
Here’s the prompt I use:
I need to cluster this JSON file containing 350 headlines from the French or Belgian press, categorized by Setfit as {{setfit_category}}.
My goal is to establish a ground truth to evaluate a supervised NLP clustering solution (BERTopic).
For the output, I need the same file, just replace: "topic_id": null, with an id, an int, corresponding to the cluster you've identified.
Example:
{
"id": 319849,
"title": "{{ example }}",
"topic_id": 2
},
The goal is to obtain very coherent clusters.
Some guidelines:
- don't create overly broad mega-clusters
- don't create a cluster unless at least 3 sources mention the same topic
- try to identify entities: people, places, events, and use them to group headlines
- I prefer maximizing outliers with precise clusters, rather than the opposite: few outliers but fuzzy clusters.
It's entirely possible that a precise and meticulous clustering yields 10 clusters with over 50% outliers.
For headlines you can't group, indicate "topic_id": -1 (outlier).
I compared actual manual labeling with Claude’s output, and the results are close, if not better — totally satisfactory.
A few observations:
Size matters: beyond 350 items, Claude Sonnet, via the app, hits the token limit. At 350 items, I’d say it works, roughly, about 80% of the time. But I always get a result anyway, you just need to refine Claude’s unfinished work. Since CSV is much less verbose than JSON, the limit might be higher with that format.
There you have it, a nice simple solution. Come September my data will probably evolve, and I’ll be able to refine my clustering almost automatically, just with an Anthropic subscription.