NLI: The Architecture Hiding Inside Your Embeddings (and a Zero-Shot Classifier You Already Have)

by Sylvain Artois on Jun 9, 2026

  • #nli
  • #natural-language-inference
  • #zero-shot-classification
  • #sentence-embeddings
  • #text-classification
  • #machine-learning

Murnau, 1910 - Alexej von Jawlensky
Murnau, 1910 - Alexej von Jawlensky

Where We Left Off

In the SetFit article, I solved one specific problem: how to classify French news headlines into categories without labeling 10,000 examples per class. SetFit’s idea was few-shot fine-tuning. You take a Sentence Transformer, train it with contrastive learning on a few examples, and add a simple logistic regression classifier on top.

But there is another path I did not cover. What if you need zero labeled examples? Not eight. Zero.

That path is Natural Language Inference (NLI). And here is the part that took me a long time to understand: NLI is not just another option next to SetFit. It is the architecture that trained the embeddings SetFit depends on. The same idea appears twice. First it is hidden, deep inside how your Sentence Transformer learned what “similar” means. Then it is visible, as a classifier you can use without any training.

I have now used NLI in three different parts of afk.live. It was a success in one, a small non-event in another, and a clear failure in the third. The failure is the most useful story, so we will get there.

What Is NLI?

Natural Language Inference, called Recognizing Textual Entailment (RTE) in older papers, asks one simple question. You have two sentences: a premise and a hypothesis. What is the relationship between them?

  • Entailment: the hypothesis follows from the premise.
  • Contradiction: the hypothesis is the opposite of the premise.
  • Neutral: neither one; the hypothesis may be true or false.

Here is a real pair from my French news pipeline:

Premise: “Les députés adoptent une nouvelle loi sur la sécurité intérieure.” (Lawmakers pass a new domestic security law.) Hypothesis: “Une décision politique a été prise.” (A political decision has been made.) → Entailment

The relationship has a direction. If the premise entails the hypothesis, that tells us nothing about the other way around. And the logic is “soft”. It is closer to what a reasonable person would understand than to strict formal logic. That softness is exactly why it works well on messy real-world text.

A Short History (It Matters Later)

NLI did not start with transformers. The history is worth knowing, because the datasets are what give the architecture its power.

  • 1994 — The FraCaS project builds a test set of about 350 inference problems. This is early computational semantics, before deep learning.1
  • 2005–2011 — The PASCAL RTE Challenges (RTE-1 to RTE-7) define the task. Each set has about 800 to 1,000 sentence pairs, annotated by hand, taken from news and encyclopedia text. Small and carefully built.2
  • 2015 — Everything changes with the Stanford NLI corpus (SNLI) (Bowman et al.).3 It has 570,000 human-written pairs, balanced across the three labels. Now there is enough data to train deep networks.
  • 2018MultiNLI (Williams et al.) adds many text genres: speech, fiction, government reports, and a test set for cross-genre transfer.4
  • 2018XNLI (Conneau et al.) extends MultiNLI to 15 languages, including French.5 This is how NLI enters my French-only pipeline.

Remember SNLI and MultiNLI. They will appear again in an unexpected place.

The Hidden Role: NLI Trained Your Embeddings

Here is the connection I did not see for months.

In the SetFit article, I used Sentence-BERT (SBERT) by Reimers & Gurevych (2019).6 It is the base of the whole sentence-transformers ecosystem. So how was the original SBERT trained to produce good sentence embeddings?

On SNLI and MultiNLI. The same NLI datasets from above.

The method: send sentence pairs through a Siamese BERT (shared weights, the same twin-network idea I drew in the SetFit article), then train a softmax classifier over the three NLI labels on top of the pooled embeddings. The side effect of learning to predict entailment, contradiction, and neutral is an embedding space where related sentences are close to each other. That embedding space is what every later cosine-similarity search, every BERTopic clustering, and yes, every SetFit run, is built on.

So when SetFit “fine-tunes a Sentence Transformer”, it is improving a representation that NLI created in the first place.

One honest note. The specific French model I used in the SetFit article, dangvantuan/sentence-camembert-large, was in fact fine-tuned on STSb (semantic textual similarity), not directly on NLI. It scores 85.9 Pearson on the French STS benchmark. But the standard SBERT method is NLI-based, and the French NLI data does exist (XNLI through FLUE). The link is real. I just do not want to overstate it for this one model.

NLI as a Zero-Shot Classifier

The second life of NLI is the one I actually deploy. It comes from Yin, Hay & Roth (2019).7 They made a very simple observation: any classification task can be rewritten as entailment.

Do you want to know if a headline is about sport? Do not train a classifier. Just ask an NLI model:

Premise: “Les Bleus remportent le match contre l’Italie.” (Les Bleus win the match against Italy.) Hypothesis: “Ce texte parle de sport.” (This text is about sport.) → P(entailment) = 0.94

Run one hypothesis per possible label, take the label with the highest entailment probability, and you have a classifier. No training data. No fine-tuning. No GPT-style prompt engineering. You write the hypotheses in plain language, and that is all. Recent research takes this idea further, building general-purpose “universal classifiers” by training a single model on many NLI-style tasks at once.8

The model I use across AFK is cmarkea/distilcamembert-base-nli:

PropertyValue
Parameters68.1M (distilled CamemBERT)
Training dataXNLI from FLUE (392,702 FR pairs)
Accuracy77.45% on the FR test set
Inference~51 ms per pair (CPU, Ryzen 5 4500U)
Cached size~270 MB
LicenseMIT

Sixty-eight million parameters. It runs on CPU. No API bill. That profile is the main reason it ended up in three different services, and the reason it is worth comparing with SetFit at the end.

Three Real Uses in AFK

Theory is clean. Production is where you learn if your tool fits your task. I have three cases, and they do not agree with each other. That is the useful part.

1. The Failure: Telling Facts from Opinions

AFK has a fact-extractor service. It pulls structured facts (who, where, what) out of curated headlines using an LLM. To save compute, I wanted to filter the headlines first: drop the “soft” ones (editorials, tributes, reaction quotes) before paying for extraction, and keep only the “hard facts” (votes, deaths, decisions, measured results).

Zero-shot NLI looked perfect. I wrote hypotheses for each editorial scope:

  • Positive (a hard fact): “Cet article rapporte un événement concret survenu dans le monde réel.” (This article reports a concrete event that occurred in the real world.)
  • Negative (soft): “Cet article rapporte la déclaration, l’opinion ou la réaction d’une personne.” (This article reports a person’s statement, opinion, or reaction.)

The score combined the strongest positive and the strongest negative signal:

is_hard = max(P_entail_positive) − 0.5 · max(P_entail_negative)

Headlines below a threshold τ get dropped. FR-native sources went through distilcamembert-base-nli. Translated titles went through MoritzLaurer/mDeBERTa-v3-base-xnli. Clean design. I calibrated it on 200 headlines labeled by hand.

F1 = 0.42. My target was 0.70.

I spent a day rewriting the hypotheses. Concrete lists for the positives (“vote, official decision, conviction, agreement”), and “absence of a fact” for the negatives. I calibrated again.

F1 = 0.567. Better. Still far from 0.70.

Here is why it failed, and this is the main lesson of the whole article. The negative hypothesis “rapporte une déclaration ou une opinion” fires strongly, with P(entailment) around 0.69, on any headline with an attribution verb: “Macron a annoncé”, “l’ONU a déclaré”. That is about 80% of the news. The NLI model was reading the form of the sentence (is there a speech verb?), not the substance (was a real event reported?). A headline that reports a real decision and a headline that reports someone’s opinion both contain “a déclaré”. NLI cannot separate them, because “does the text entail this claim?” is not the same question as “is this hard news?”.

I had matched the wrong tool to the task. So I removed the NLI filter and moved the editorial decision into the LLM extraction prompt itself. Mistral already reads every headline during extraction, so I let it flag the soft ones at almost no extra cost. The NLI code stays in the project, turned off, as a backup. But it is off in production.

The lesson: NLI answers “does the text entail this claim?”. This is a question about meaning relations, and it leans on surface words and structure. It does not answer “is this text the kind of thing I care about?” when the difference is about substance hidden under the same surface form.

2. The Restraint: Knowing When Not to Use It

The idea-curator service sorts think-tank publications into four types: Étude, Note, Rapport, Tribune. I could have used NLI again here. The model was already loaded elsewhere, and it is free.

I did not. Or rather, I used it only as a tie-breaker.

The key point is that three of these four types are decided by format, not by meaning. A Rapport is long. A Note is short. An Étude is in between. RSS category tags plus a page-length rule get you most of the way:

text_length ≥ 1200 → Rapport
text_length ≥ 500  → Étude
otherwise          → Note

Only Tribune (an opinion piece) has a real stance signal that length cannot catch. So that is the only place NLI runs: when the rules are not clear, distilcamembert-base-nli does a zero-shot pass to decide if a short piece is an opinion or just a short note. It runs on about 5% of the rows. No LLM spend on typing at all.

This is the boring story, and that is the point. The real engineering decision was choosing not to use the advanced tool where simple rules already win, and keeping NLI as cheap insurance for the truly unclear cases. The fact-extractor failure taught me that.

3. The Win: Detecting Rhetorical Stance

The third use is where NLI works very well: the Latouromètre, a metric in the words-weight column pipeline. It places political speech on the four “attractors” from Bruno Latour’s book Où atterrir?9 (Terrestrial, Global, Out-of-this-world, Local).

The first version placed text on each pole using cosine similarity against seed phrases. This is pure embedding geometry. It worked for sincere texts. It failed badly on rhetorical inversion: when someone uses ecological vocabulary to attack ecologists. Cosine similarity sees the ecological words and scores the text as Terrestrial, which is the opposite of the truth.

This is a stance problem: does the text support or attack the pole? And “does this text entail I support X or I oppose X?” is exactly what NLI is built for. Finally the task matched the tool.

I wrote 6 hypotheses per pole: “pro” hypotheses and “contra” hypotheses, made to catch inversions.

  • Pro: “Ce texte appelle à habiter la Terre et à composer avec le vivant.” (This text calls for inhabiting the Earth and living alongside other forms of life.)
  • Contra: “Ce texte mobilise un vocabulaire écologique pour s’opposer à des projets de transition énergétique.” (This text uses ecological vocabulary to oppose energy-transition projects.)

The stance signal for each pole is the difference between pro and contra entailment. Then I combine it with the cosine signal by addition:

stance[P] = mean(P_entail_pro) − mean(P_entail_contra)
raw[P]    = cosine[P] + γ · stance[P]        # γ = 1.0

The addition is important. On a neutral text, stance ≈ 0, so the cosine geometry passes through with no change. On an inversion, the stance term flips the result clearly.

The numbers, on a 21-text labeled set:

TextPatternMargin beforeMargin after stance
Bruckner (anti-Greta)ecological words used to attack0.05 (wrong)0.42 (correctly flipped)
Butré (anti-wind-farm)same inversion0.13 (wrong)0.60 (correctly flipped)

Total accuracy went to 19/21. The two remaining errors are known, documented edge cases. The whole stance layer adds about 10 to 15 seconds per chronicle on CPU. No GPU, no API.

It is the same 68M-parameter model as the two other services. The difference between failure and success was not the model. It was whether the task was a real entailment question or only looked like one.

NLI vs. The Alternatives

So when should you use NLI? Let us go back to the question that started the SetFit series: how do you classify text without a large labeled dataset? There are four real answers, and NLI is one of them.

ApproachLabels neededTraining?Cost / footprintBest when
LLM prompting (GPT, Mistral, Claude…)0NoneAPI calls, $$, or a large local modelYou need careful judgment, substance over form, and you can pay the latency/cost
SetFit (few-shot fine-tuning)~8–50 / classYes (minutes)Small model, fast local inferenceYou have a few labels and want a fast, stable classifier you can reuse
NLI zero-shot (entailment)0None~270 MB, CPU, no APIThe decision is a real entailment / stance question, and the labels are not known in advance
Embeddings + cosine (seed phrases)0 (just seeds)NoneEmbedding model, CPUThematic closeness (“is this about X?”), sincere texts, no rhetorical inversion

The trade-offs that decided things for me:

  • Ease of use: NLI is the easiest to start with. You write a hypothesis in French, and you get a probability. No dataset, no training loop, no prompt tuning on a remote model. SetFit needs a labeling pass. LLM prompting needs prompt engineering and a billing account.
  • Memory and cost: this is NLI’s best feature. 68M parameters, about 270 MB, runs on CPU at around 50 ms per inference, MIT license, no network. For a self-hosted prototype, that is a big advantage. SetFit is similarly light once trained. LLM prompting is at the other end of the scale.
  • The catch: NLI’s low cost is wasted if your task is not really an entailment question. SetFit, with a few labels, can learn whatever boundary your data implies, including the substance-vs-form difference that broke my fact filter. NLI cannot learn. It can only ask the question you write, and it answers it through surface meaning.

That is the link back to the SetFit article. SetFit and NLI are two sides of the same coin made by Sentence-BERT. SetFit fine-tunes the NLI-born embedding space with a few labels. NLI reuses the entailment head directly with none. SetFit learns your boundary. NLI borrows a general one. Choose SetFit when your distinction lives in your data and you can label a little. Choose NLI when your distinction is really a question of entailment or stance, and enjoy not training anything at all.


This article is part of my journey learning ML as a senior engineer without a data science background. I document what I learn, including the failures. The F1 = 0.567 that did not pass is here on purpose, because the clean tutorials never show you the experiments that failed.

Notes

Footnotes

  1. The FraCaS test suite (Cooper et al., 1996): 346 inference problems across nine categories of semantic phenomena, built by the EU FraCaS project on computational semantics. See R. Cooper et al., Testing the FraCaS test suite. A French version of the suite (LREC 2020) also exists.

  2. Dagan, I., Glickman, O., & Magnini, B. (2005). The PASCAL Recognising Textual Entailment Challenge. Machine Learning Challenges, LNCS 3944, Springer.

  3. Bowman, S.R., Angeli, G., Potts, C., & Manning, C.D. (2015). A large annotated corpus for learning natural language inference (SNLI). EMNLP 2015.

  4. Williams, A., Nangia, N., & Bowman, S.R. (2018). A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference (MultiNLI). NAACL 2018. (Project page.)

  5. Conneau, A., et al. (2018). XNLI: Evaluating Cross-lingual Sentence Representations. EMNLP 2018.

  6. Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019.

  7. Yin, W., Hay, J., & Roth, D. (2019). Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach. EMNLP-IJCNLP 2019.

  8. Laurer, M., van Atteveldt, W., Casas, A., & Welbers, K. (2023). Building Efficient Universal Classifiers with Natural Language Inference.

  9. Latour, B. (2017). Où atterrir ? Comment s’orienter en politique. La Découverte.