Using Claude Sonnet for NLP Clustering

by Sylvain Artois on Aug 27, 2025

Fantômas, 1915 - Juan Gris - www.nga.gov
Fantômas, 1915 - Juan Gris - www.nga.gov

I built myself a little hyperparameter tuning setup for BERTopic to evaluate the best possible combinations for umap / hdbscan settings and a few BERTopic parameters.

The script spins up a Python subprocess for each possible combination, this kind of thing:

{
  "umap_n_neighbors": [12, 14, 16],
  "umap_n_components": [7, 8, 9, 10, 11],
  "umap_min_dist": [0.1],
  "hdbscan_min_cluster_size": [3],
  "hdbscan_min_samples": [8, 9, 10],
  "hdbscan_cluster_selection_epsilon": [0.3, 0.5],
  "bertopic_top_n_words": [12],
  "nr_topics": ["none"],
  "n_gram_max": [2, 3],
  "embedding_model": ["dangvantuan/sentence-camembert-large"],
  "remove_stopwords": [false]
}

To evaluate the clustering, I need a ground truth, hand-labeled, to compare each tested parameter combination against the manual clustering.

But it’s tedious. I have 9 headline categories (setfit), which potentially means 9 BERTopic configs, and therefore 9 datasets… Not easy.

I wondered if Claude Sonnet could do the clustering for me.

I feed it raw data, a SQL query with proportional sampling by source to be somewhat representative, this kind of horror that would’ve been unthinkable before having a DBA bot permanently on tap:

WITH distinct_headlines AS (
  SELECT DISTINCT h.id, h.title, h.source_slug
  FROM headlines h
  JOIN headlines_batches hb ON h.id = hb.headline_id
  JOIN parsing_runs pr ON hb.batch_id = pr.batch_id
  JOIN sources s ON h.source_slug = s.slug
),
ranked_headlines AS (
  SELECT id, title, source_slug,
  ROW_NUMBER() OVER (PARTITION BY source_slug ORDER BY RANDOM()) as rn,
  CEIL(COUNT(*) OVER (PARTITION BY source_slug) * 0.66) as max_per_source
  FROM distinct_headlines
)
SELECT json_agg(
  json_build_object(
    'id', id,
    'title', title,
    'topic_id', null
  )
) AS headlines
FROM ranked_headlines
WHERE rn <= max_per_source

Here’s the prompt I use:

I need to cluster this JSON file containing 350 headlines from the French or Belgian press, categorized by Setfit as {{setfit_category}}.

My goal is to establish a ground truth to evaluate a supervised NLP clustering solution (BERTopic).

For the output, I need the same file, just replace: "topic_id": null, with an id, an int, corresponding to the cluster you've identified.

Example:
{
  "id": 319849,
  "title": "{{ example }}",
  "topic_id": 2
},

The goal is to obtain very coherent clusters.

Some guidelines:
- don't create overly broad mega-clusters
- don't create a cluster unless at least 3 sources mention the same topic
- try to identify entities: people, places, events, and use them to group headlines
- I prefer maximizing outliers with precise clusters, rather than the opposite: few outliers but fuzzy clusters.

It's entirely possible that a precise and meticulous clustering yields 10 clusters with over 50% outliers.
For headlines you can't group, indicate "topic_id": -1 (outlier).

I compared actual manual labeling with Claude’s output, and the results are close, if not better — totally satisfactory.

A few observations:

Size matters: beyond 350 items, Claude Sonnet, via the app, hits the token limit. At 350 items, I’d say it works, roughly, about 80% of the time. But I always get a result anyway, you just need to refine Claude’s unfinished work. Since CSV is much less verbose than JSON, the limit might be higher with that format.

There you have it, a nice simple solution. Come September my data will probably evolve, and I’ll be able to refine my clustering almost automatically, just with an Anthropic subscription.

Share on LinkedIn


Leave a Comment