Why SetFit? Few-Shot Text Classification for Engineers Who Don't Have 10,000 Labeled Samples

The Dataset Problem

So you need to classify text into categories. Maybe it’s support tickets, news articles, or customer feedback. You want something that fits your exact data, not a generic Kaggle dataset. And you have the data, but it’s not labeled. And labeling 10,000 examples is either expensive or painful. Probably both.

Traditional text classification with BERT-style models typically requires 1,000 to 10,000+ labeled examples per class to get decent performance. If you’re classifying into 20 categories, that’s potentially 200,000 labeled samples. Who’s going to label all that? Not me.

I’m building a news aggregation system that classifies French headlines into categories like “Géopolitique”, “Sport”, “Technologie” (have a look at afk.live if you’re curious). I can’t label thousands of headlines manually, and I don’t want to use Amazon Mechanical Turk. I need a solution that works with limited data.

After a lot of searching—probably during a conversation with Claude—I discovered SetFit. And that was it.

What is SetFit?

SetFit stands for Sentence Transformer Fine-tuning, which tells you a lot about what it does: it fine-tunes Sentence Transformers for classification tasks.

The Origin Story

SetFit was introduced in September 2022 in the paper “Efficient Few-Shot Learning Without Prompts”. It’s a collaboration between:

Hugging Face: Lewis Tunstall (who also wrote the Natural Language Processing with Transformers book) and Nils Reimers (the creator of Sentence Transformers, now Director of ML at Cohere)
Intel Labs: Moshe Wasserblat, Daniel Korat, and Oren Pereg from Intel’s NLP research group

The Key Claim

With only 8 labeled examples per class, SetFit achieves performance competitive with fine-tuning RoBERTa-Large on 3,000 examples. That’s not a typo. 8 examples versus 3,000.

On the RAFT benchmark (Real-world Annotated Few-shot Tasks), SetFit outperformed GPT-3 on 7 out of 11 tasks while being 1,600 times smaller.

How SetFit Works: The Two-Phase Architecture

The architecture is simple. Training happens in two phases:

SetFit Training Phases

Phase 1: Teaching the Model What “Similar” Means

The first phase uses contrastive learning to fine-tune a pre-trained Sentence Transformer. The model learns to:

Pull together embeddings of sentences from the same class
Push apart embeddings of sentences from different classes

This is done using sentence pairs. Here’s why this is powerful: if you have 8 sentences per class, you can create many more training pairs. With 8 positive and 8 negative sentences, you get 28 positive pairs and 64 negative pairs. The number of unique training examples grows fast.

Phase 2: A Simple Classifier on Top

Once the Sentence Transformer has learned to create meaningful embeddings (where similar texts are close together), training a classifier becomes easy. By default, SetFit uses a scikit-learn Logistic Regression head. Nothing fancy, but it works.

In code, this looks like:

from setfit import SetFitModel, Trainer, TrainingArguments

# Initialize with a pre-trained Sentence Transformer
model = SetFitModel.from_pretrained(
    "dangvantuan/sentence-camembert-large",  # French language model
    labels=["Géopolitique", "Sport", "Technologie", ...]
)

# Training arguments
args = TrainingArguments(
    batch_size=16,
    num_epochs=1,              # Usually just 1 epoch is enough
    num_iterations=5,          # Number of text pairs per class
    sampling_strategy="oversampling"
)

# Train
trainer = Trainer(model=model, args=args, train_dataset=train_dataset)
trainer.train()

That’s it. The framework handles the pair generation and two-phase training automatically.

Understanding Siamese Networks and Contrastive Learning

If “contrastive learning” and “Siamese networks” sound intimidating—I was confused too at first—let me break them down.

Siamese Networks: Twin Neural Networks

A Siamese network is simply two identical neural networks that share the same weights. Think of them as twins processing two inputs at the same time.

Siamese Network Architecture

Both sentences pass through the exact same model (shared weights). The outputs are two vectors (embeddings), and we measure how close or far apart they are.

Fewer parameters: One model instead of two means faster training
Consistency: Similar inputs produce similar outputs because they’re processed the same way
Comparability: The embeddings live in the same vector space, so distances are meaningful

Contrastive Learning: Learning by Comparison

The core idea is simple: learn what makes things similar or different by comparing pairs.

Let me show you with real headlines from my French news dataset:

Positive pair (same category “Géopolitique”):

“Soudan: 60 morts dans une attaque de drone à el-Facher”
“Au Bénin, échec d’une tentative de coup d’Etat contre le président Patrice Talon”

The model should learn to place these close together in the embedding space.

Negative pair (different categories):

“Soudan: 60 morts dans une attaque de drone à el-Facher” (Géopolitique)
“Les Bleus remportent le match contre l’Italie” (Sport)

The model should push these far apart.

The Cosine Similarity Loss

SetFit uses cosine similarity to measure how close two embeddings are. Values range from -1 (opposite) to 1 (identical).

Cosine Similarity
       │
   1.0 ├─── Perfect match (same direction)
       │
   0.0 ├─── Orthogonal (unrelated)
       │
  -1.0 ├─── Opposite (completely different)
       │

During training:

Positive pairs: The loss penalizes low cosine similarity (push toward 1.0)
Negative pairs: The loss penalizes high cosine similarity (push toward 0.0 or below)

Why SetFit Over Traditional Approaches?

Dataset Size: The Main Advantage

Approach	Examples per Class	Notes
Traditional BERT fine-tuning	1,000 - 10,000+	Requires large labeled datasets
PET (Pattern-Exploiting Training)	10 - 100	Requires handcrafted prompts
GPT-3/4 prompting	0 - 10	Expensive API calls, prompt engineering
SetFit	8 - 50	No prompts, fast training, runs locally

Training Speed and Cost

Here’s what I see in production on my Dell G16 with an RTX 4070 (a mid-range gaming laptop, self-hosted inference):

Operation	Headlines	Time	Speed
Classification	577	~0.5s	~1,150/sec
Model loading	-	~2.5s	-
Full batch (load + classify + DB update)	577	~8s	-

For comparison, from the original paper:

Model	Hardware	Training Time	Cost (approx.)
SetFit	NVIDIA V100	~30 seconds	~$0.025
T-Few 3B	NVIDIA A100	~11 minutes	~$0.70

That’s 28x cheaper and much faster.

No Prompts Required

Unlike PET or GPT-3 prompting, SetFit doesn’t require you to craft clever prompts. You just provide labeled examples, and it figures out the rest.

This matters because:

Prompt engineering is hard and requires experimentation
Prompts are fragile - small changes can change results a lot
Different languages need different prompts - SetFit just needs a multilingual Sentence Transformer

When Should You Use SetFit?

In theory, SetFit is the right tool when you have:

Limited labeled data: You have 10-100 examples per class, not thousands
Multilingual projects: Just swap the base Sentence Transformer model
Fast prototyping needs: Train a classifier in under a minute
Production constraints: Small model size, fast inference, runs locally
Privacy-sensitive data: No need to send data to external APIs

The SetFit Ecosystem

SetFit has grown into a mature library with good documentation and community support:

Official Resources

Documentation: huggingface.co/docs/setfit
GitHub: github.com/huggingface/setfit
Original Paper: arxiv.org/abs/2209.11055
Hugging Face Blog: huggingface.co/blog/setfit
Few-Shot Learning in Production: A workshop by Hugging Face and Intel covering SetFit, model compression, and deployment (1h22m)

Community

The SetFit organization on Hugging Face hosts hundreds of pre-trained models for various tasks and languages. With ~250,000 downloads per month, you’re not alone.

What’s Next in This Series

This article covered the “why” and “what” of SetFit. In the next articles, I’ll go into the practical “how”.

But first, why SetFit in the AFK data pipeline? I have two main goals:

Pre-process headlines before BERTopic clustering: categorize them first so I can adjust UMAP and HDBScan settings based on each category’s semantic field
Implement an editorial policy: filter headlines by category to control what appears on the site

You’ll see that SetFit alone is not a magic tool that can handle everything. But it’s a useful filter in building the pipeline.

Article 2: Building a SetFit Dataset: The Taxonomy

How to structure your taxonomy
Labeling strategies
Dataset preparation for SetFit

This article is part of my journey learning ML as a senior engineer without a data science background. I’m documenting what I learn, including the mistakes, because the polished tutorials rarely show the messy reality of building ML systems.

References

Tunstall, L., Reimers, N., Jo, U.E.S., Bates, L., Korat, D., Wasserblat, M., & Pereg, O. (2022). Efficient Few-Shot Learning Without Prompts. arXiv:2209.11055.
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019.
Schick, T., & Schütze, H. (2020). Exploiting Cloze Questions for Few-Shot Text Classification and Natural Language Inference. arXiv:2001.07676.
Alex, N., et al. (2021). RAFT: A Real-World Few-Shot Text Classification Benchmark. NeurIPS 2021.