Using Claude Sonnet for NLP Clustering

I built myself a little hyperparameter tuning setup for BERTopic to evaluate the best possible combinations for umap / hdbscan settings and a few BERTopic parameters.

The script spins up a Python subprocess for each possible combination, this kind of thing:

{
  "umap_n_neighbors": [12, 14, 16],
  "umap_n_components": [7, 8, 9, 10, 11],
  "umap_min_dist": [0.1],
  "hdbscan_min_cluster_size": [3],
  "hdbscan_min_samples": [8, 9, 10],
  "hdbscan_cluster_selection_epsilon": [0.3, 0.5],
  "bertopic_top_n_words": [12],
  "nr_topics": ["none"],
  "n_gram_max": [2, 3],
  "embedding_model": ["dangvantuan/sentence-camembert-large"],
  "remove_stopwords": [false]
}

To evaluate the clustering, I need a ground truth, hand-labeled, to compare each tested parameter combination against the manual clustering.

But it’s tedious. I have 9 headline categories (setfit), which potentially means 9 BERTopic configs, and therefore 9 datasets… Not easy.

I wondered if Claude Sonnet could do the clustering for me.

I feed it raw data, a SQL query with proportional sampling by source to be somewhat representative, this kind of horror that would’ve been unthinkable before having a DBA bot permanently on tap:

WITH distinct_headlines AS (
  SELECT DISTINCT h.id, h.title, h.source_slug
  FROM headlines h
  JOIN headlines_batches hb ON h.id = hb.headline_id
  JOIN parsing_runs pr ON hb.batch_id = pr.batch_id
  JOIN sources s ON h.source_slug = s.slug
),
ranked_headlines AS (
  SELECT id, title, source_slug,
  ROW_NUMBER() OVER (PARTITION BY source_slug ORDER BY RANDOM()) as rn,
  CEIL(COUNT(*) OVER (PARTITION BY source_slug) * 0.66) as max_per_source
  FROM distinct_headlines
)
SELECT json_agg(
  json_build_object(
    'id', id,
    'title', title,
    'topic_id', null
  )
) AS headlines
FROM ranked_headlines
WHERE rn <= max_per_source

Here’s the prompt I use:

I need to cluster this JSON file containing 350 headlines from the French or Belgian press, categorized by Setfit as {{setfit_category}}.

My goal is to establish a ground truth to evaluate a supervised NLP clustering solution (BERTopic).

For the output, I need the same file, just replace: "topic_id": null, with an id, an int, corresponding to the cluster you've identified.

Example:
{
  "id": 319849,
  "title": "{{ example }}",
  "topic_id": 2
},

The goal is to obtain very coherent clusters.

Some guidelines:
- don't create overly broad mega-clusters
- don't create a cluster unless at least 3 sources mention the same topic
- try to identify entities: people, places, events, and use them to group headlines
- I prefer maximizing outliers with precise clusters, rather than the opposite: few outliers but fuzzy clusters.

It's entirely possible that a precise and meticulous clustering yields 10 clusters with over 50% outliers.
For headlines you can't group, indicate "topic_id": -1 (outlier).

I compared actual manual labeling with Claude’s output, and the results are close, if not better — totally satisfactory.

A few observations:

Size matters: beyond 350 items, Claude Sonnet, via the app, hits the token limit. At 350 items, I’d say it works, roughly, about 80% of the time. But I always get a result anyway, you just need to refine Claude’s unfinished work. Since CSV is much less verbose than JSON, the limit might be higher with that format.

There you have it, a nice simple solution. Come September my data will probably evolve, and I’ll be able to refine my clustering almost automatically, just with an Anthropic subscription.

Using Claude Sonnet for NLP Clustering

Leave a Comment