Trusting a Multi-Agent Pipeline: 6-grams, Phoenix, and a Reconciliation Step

AFK is a news aggregation platform. One of its features lets me turn a curated, opinionated prompt into a multimedia dossier — a small knowledge graph of headlines, videos, podcasts, books, web excerpts, all built by an agent. This post is about a problem I didn’t see coming in that pipeline, and the deep-dive into smolagents it took to fix it. It’s deliberately open — I’m sharing what works for me today, not a finished playbook.

The pipeline, in one paragraph

The dossier pipeline is built on smolagents, HuggingFace’s lightweight agent framework. A CodeAgent (Claude Sonnet, the coordinator) reads a French-language prompt, decides what to look for, and delegates to five ToolCallingAgent specialists (Claude Haiku — headlines, video, audio, knowledge, music). Each specialist is wired to a small, deterministic set of tools: web_excerpt, wikipedia_excerpt, afk_headlines, youtube_search, radiofrance, halldulivre, brave_web_search, and so on. Everything runs through LiteLLMModel so the same code talks to the Anthropic API.

The whole surface area fits in a few imports:

from smolagents import CodeAgent, ToolCallingAgent, LiteLLMModel
from smolagents.memory import ActionStep

CodeAgent is the smolagents flavour that thinks in Python — its “actions” are short scripts the framework executes in a sandbox, which is what makes it suitable for orchestration (it can call its managed agents like functions). ToolCallingAgent is the more conventional one — it emits structured tool calls, runs them, and feeds the observations back into its context. LiteLLMModel is the model adapter; swap the model_id and you can point it at OpenAI, Bedrock, anything LiteLLM supports.

Two things matter for the rest of this post:

Tools are deterministic. web_excerpt returns a verbatim substring of a live page, plus a SHA-256 hash of the URL as external_id. afk_headlines returns rows from a database with integer IDs.
The coordinator is the only thing the user sees. Specialist outputs are intermediate. The final graph is whatever the coordinator decides to assemble.

That second point turned out to matter more than I thought.

The Wednesday I lost trust

April 29, I shipped a new prompt: a charged dossier on thirty years of French legislative repression against free parties, with ten explicit URLs the coordinator must extract and an unambiguous editorial stance (“AFK takes the side of free assembly”). The coordinator ran for ~13 steps, the specialists chimed in, the JSON came out clean.

Then I started reading the JSON.

In the metadata.excerpt of the dijoncter.info node, I noticed a phrase I half-recognized: “comme forme de résistance culturelle”. It sounded like the prompt — not like the source. I opened the live page. The source said: “favorise la politisation du mouvement (création de Technopol et de Techno+) et l’essor des Free parties”. No “résistance culturelle”. The coordinator had added editorial commentary the journalist had never written.

That’s not a paraphrase. That’s a category violation.

Why this matters for AFK

AFK’s value proposition is one sentence: honest indexing of journalism. Readers click through to the source. If the quote on AFK doesn’t match what the source actually says, the whole proposition collapses — not gracefully, not partially. The point of the platform is that you can trust what you read on it.

A coordinator that silently rewrites verbatim quotes is, for AFK, an existential bug.

Diagnosis: 6-gram overlap

I wrote a small replay script that diffs every final-graph excerpt against the actual WebExcerptTool observation in the agent trace, scoring overlap as the fraction of shared 6-word sequences (lowercased). The number is brutal because at 6 grams you’re scoring near-verbatim substrings; anything below 50% means the model rewrote the sentence in its own words.

URL	Tool (chars)	Final (chars)	6-gram overlap	Verdict
lemonde.fr	879	331	0.0 %	fabricated
tsugi.fr	785	290	2.1 %	fabricated
cheperz.org	385	167	0.0 %	fabricated
aoc.media	1 140	194	17.4 %	fabricated
technoplus/mariani	771	305	10.3 %	fabricated
dijoncter.info	355	233	26.5 %	fabricated
technoplus/loi	933	221	37.9 %	fabricated
technoplus/chronologie	330	185	68.0 %	rewritten
lemediatv.fr	459	143	68.4 %	rewritten

Nine out of nine. The tool returned the right text every time — the coordinator (temperature=0.7) reformulated every one of them while assembling the final JSON.

The hashes were even worse. WebExcerptTool builds external_id as hashlib.sha256(url).hexdigest()[:16]. The coordinator can’t compute SHA-256 in its head — so it just types plausible-looking hex and moves on. Every web external_id in the graph was invented:

URL	Real hash (tool)	Graph hash	Match
lemonde.fr	`34030d787b833915`	`8f3c2a1b4d5e6f7a`	NO
tsugi.fr	`ab9461ae48abf932`	`7b2d9e4c1a8f3b5e`	NO
technoplus/chronologie	`497a5350da44fe67`	`5c4e8a2f1b9d3e7c`	NO
technoplus/loi	`c4fa476f28013cf0`	`3d7f2b5e9a1c4e8d`	NO
technoplus/mariani	`69080d3674433bbc`	`2e9a4c7f1d5b8e3a`	NO
dijoncter.info	`5ba7ed7e1cb0b71b`	`1f8c5e3a9d2b7e4c`	NO
cheperz.org	`c411314986ef174f`	`6e3c9a2f7d1b5e8a`	NO
aoc.media	`41628209b8b3f09e`	`9d2e7a4c1f5b8e3a`	NO
lemediatv.fr	`512614784b69ee60`	`8c5e2a9f1d7b3e4c`	NO

Nine nodes, nine fabricated hashes. And on top of that, when greenroom.fr returned {"found": false, "error": "Connection failed"} (DNS dead), the coordinator kept the URL anyway, invented an external_id, and wrote an excerpt paraphrasing the user prompt. A whole node manufactured from a failed tool call.

Phoenix is what made this visible

I want to dwell on this, because I think it’s the most generalizable part of the story.

I spotted the problem straight away reading the JSON — “résistance culturelle” didn’t sound like the source. But spotting it is not the same as proving it. The proof lived one layer below: in the actual tool observations the specialists returned to the coordinator before assembly. Without Phoenix, I could see something was off, but I couldn’t run the forensics — diffing every final excerpt against the real tool output, node by node. The bug would have shipped anyway, because I wouldn’t have known how deep it went.

The setup is small, but it has one gotcha worth flagging. I use arize-phoenix-otel + openinference-instrumentation-smolagents:

tracer_provider = register(
    project_name="topics-matcher-agentic",
    endpoint=os.getenv("PHOENIX_COLLECTOR_ENDPOINT", "http://localhost:4317"),
)
tracer_provider.add_span_processor(session_stamp_processor, replace_default_processor=False)
SmolagentsInstrumentor().instrument(tracer_provider=tracer_provider)

Two things to know. First, replace_default_processor=False is mandatory — the default in phoenix.otel is True, which silently removes Phoenix’s exporter when you add your own processor. Lose an afternoon to that one if you want. Second, smolagents creates a fresh trace_id per managed-agent call, so without intervention every specialist invocation lands in a different Phoenix trace and you can’t see a run end-to-end. The fix is a custom SpanProcessor that stamps a per-run session.id on every span:

class _SessionStampProcessor(SpanProcessor):
    def __init__(self):
        self._session_id = None

    def set_session_id(self, session_id: str):
        self._session_id = session_id

    def on_start(self, span, parent_context=None):
        if self._session_id and span.is_recording():
            span.set_attribute("session.id", self._session_id)

    def on_end(self, span): pass
    def shutdown(self): pass
    def force_flush(self, timeout_millis=None): return True

Set the session ID once per run (session_stamp_processor.set_session_id(f"agentic-{int(time.time())}")) and the whole run — coordinator + every specialist call + every tool observation — collapses into one Phoenix session you can scroll through.

Monitoring is not optional for multi-agent systems. The coordinator is a closed black box if you only see its output.

Reconciliation: the idea

Once I could see the lie, I needed a way to fix it deterministically. Asking the model not to lie wasn’t going to cut it — temperature=0.7 and a 30-step reasoning budget gave it plenty of room to keep being creative. The right shape was post-processing, not better prompting.

I called the pass reconcile. It walks every tool observation from the run, builds a multi-key index, and matches each graph node against it. The index shape:

def _empty_index() -> dict:
    return {
        "by_url": {},            # web + Wikipedia nodes
        "by_headline_id": {},    # AFK headlines
        "by_video_id": {},       # YouTube / AFK videos
        "by_external_id": {},    # (source, id) pairs: Open Library, Deezer, …
        "by_wiki_title": {},     # Wikipedia title fallback
    }

Each tool gets a small indexer function that knows how to extract its identifying fields, registered in a dispatch map:

_TOOL_INDEXERS = {
    "web_excerpt":              lambda obs, idx: _index_web(obs, idx["by_url"]),
    "wikipedia_excerpt":        lambda obs, idx: _index_wikipedia(obs, idx),
    "search_afk_headlines":     lambda obs, idx: _index_headlines(obs, idx["by_headline_id"]),
    "search_youtube":           lambda obs, idx: _index_youtube(obs, idx["by_video_id"]),
    "search_books_openlibrary": lambda obs, idx: _index_generic(obs, idx["by_external_id"], "openlibrary", "olid"),
    "search_deezer_artists":    lambda obs, idx: _index_generic(obs, idx["by_external_id"], "deezer", "deezer_id", is_list=False),
    # … one row per tool
}

The reconciliation step itself walks the graph nodes, looks each one up by its identifying field, and does one of three things: keep-and-overwrite, keep-as-editorial, or drop:

for i, node in enumerate(graph.nodes):
    # Editorial nodes (root/central) are preserved even without a tool match
    if node.node_type in ("root", "central"):
        reconciled_nodes.append(node)
        stats["nodes_kept_editorial"] += 1
        continue

    result = _reconcile_node(node, obs_index)
    if result is not None:
        reconciled_nodes.append(result)
        stats["nodes_matched"] += 1
    else:
        stats["nodes_dropped"] += 1
        logger.warning(
            f"Reconcile: dropped node idx={i} "
            f"(source={node.source}, title={node.title!r}) "
            f"-- no matching tool observation"
        )

The contract is clear: technical fields (external_id, excerpt, url) get overwritten from the tool side, no questions asked. Editorial fields (title, llm_rationale, ordering) are preserved — that’s the value the coordinator legitimately adds. Any non-editorial node not backed by a real tool observation is dropped.

by_wiki_title is in the index because the coordinator likes shortening article titles (“Réseau Natura 2000” becomes “Natura 2000”). When URL match fails on a Wikipedia node, we fall back to title match — exact first, then substring. There’s a similar fallback for web nodes (the coordinator sometimes truncates https://www.eea.europa.eu/.../state-of-europes-biodiversity to just https://www.eea.europa.eu/, which we can recover by domain match) and for Open Library identifiers (the coordinator strips the /works/ prefix). Each of these fallbacks is a small specific patch for an observed lie pattern, not a general “fuzzy match anything” rule.

So far so good. Then I tried to actually populate the index.

Attempt 1: `_PersistentMemoryAgent` — fighting the lifecycle

My first instinct was to walk the specialist memory after the coordinator finished:

for agent_name, specialist in coordinator.managed_agents.items():
    for step in specialist.memory.steps:
        ...  # extract tool_calls + observations

The index came out almost empty.

It took digging into the smolagents source to understand why. ToolCallingAgent.__call__ (the path the coordinator hits when it delegates) ends up in run(), and run() opens with this:

def run(self, task, stream=False, reset=True, ...):
    if reset:
        self.memory.reset()

Every time the coordinator delegates to a specialist, the specialist’s memory is wiped. After a full run, each specialist only remembers its last call. The coordinator delegated to knowledge_specialist seven times during the free-party run (Wikipedia, Brave search, web excerpts, books, …); only the books call survived in memory.

My first fix was a 4-line subclass forcing reset=False:

class _PersistentMemoryAgent(ToolCallingAgent):
    def __call__(self, task: str, **kwargs):
        kwargs.setdefault("reset", False)
        return super().__call__(task, **kwargs)

It worked correctly. The observation index jumped from near-empty to 104 entries, the reconciliation kept 11 of 17 nodes (with three small URL-mismatch fixes that became fallback matchers), and the output was honest.

The token bill was unshippable. Each specialist now re-ingested its growing memory at every coordinator call: system prompt + tool descriptions + every previous step + every previous observation. Across the run, that pulled the dossier from 1.31M tokens (broken reconcile baseline) to 2.09M (+59%). A separate phase of work I’d just landed (trimming default tool result limits, dropping unused fields) had clawed back ~600K tokens — reset=False ate every one of them.

Working but unshippable.

Attempt 2: `step_callbacks` accumulator — using the side channel

The reconciliation only needs (tool_name, observation_json) pairs. It doesn’t need system prompts, model outputs, or step ordering. So the right move was to capture exactly that, outside the agent’s memory lifecycle entirely.

smolagents exposes step_callbacks on MultiStepAgent.__init__. After every ActionStep, the framework calls your callback with the full step object — including .tool_calls and .observations. The callback can write to whatever external state you want; it’s not bound by memory.reset().

The whole accumulator is ~60 lines:

class ObservationAccumulator:
    """One shared instance, passed to every specialist via step_callbacks.
    Lives outside agent memory, so it survives memory.reset()."""

    def __init__(self) -> None:
        # list of (agent_name, tool_calls, parsed_observation_objects)
        self._steps: list[tuple[str, list, list[dict]]] = []

    def make_callback(self, agent_name: str):
        def _callback(step) -> None:
            tool_calls = getattr(step, "tool_calls", None)
            observations_str = getattr(step, "observations", None)
            if not tool_calls or not observations_str:
                return
            parsed = _parse_json_objects(observations_str)
            if not parsed:
                return
            # Copy tool_calls — the step object may be mutated by later
            # smolagents lifecycle hooks.
            self._steps.append((agent_name, list(tool_calls), parsed))
        return _callback

    def build_index(self) -> dict:
        """Fold accumulated observations into the standard reconcile index."""
        index = _empty_index()
        for agent_name, tool_calls, parsed_objects in self._steps:
            _pair_tools_to_obs(tool_calls, parsed_objects, index, context=agent_name)
        return index

Wiring it in is one line per specialist:

accumulator = ObservationAccumulator()

knowledge_specialist = ToolCallingAgent(
    tools=[WikipediaExcerptTool(), WebExcerptTool(), BraveWebSearchTool(), ...],
    model=specialist_model,
    name="knowledge_specialist",
    description="Encyclopedic context, books, web excerpts.",
    max_steps=4,
    step_callbacks={ActionStep: accumulator.make_callback("knowledge_specialist")},
)
# ... same pattern for the other specialists

coordinator = CodeAgent(
    tools=[],
    model=coordinator_model,
    managed_agents=[knowledge_specialist, ...],
    instructions=system_instructions,
    additional_authorized_imports=["json"],
    max_steps=30,
)

result = coordinator.run(rendered_prompt)
graph = parse_coordinator_output(result, prompt)
graph = reconcile_tool_observations(coordinator, graph, accumulator=accumulator)

One shared accumulator instance, distinct closure per specialist (tagged with its name, useful when chasing down which specialist produced which observation), and they all write into the same list. Specialists go back to the smolagents default reset=True — small context, fast — and the accumulator survives across calls because it’s not owned by any agent.

The lesson, in one line: when a library’s default lifecycle doesn’t let you observe what you need, don’t subclass to fight it. Look for a side channel. smolagents had one. I just didn’t notice on the first pass.

The numbers

Phoenix session totals, identical prompt, four runs:

Run	Date	Prompt	Completion	Total	Δ vs v1
v1 — broken reconcile (Phase 4 only)	2026-04-17	1,236,669	74,581	1,311,250	baseline
v2 — `_PersistentMemoryAgent`	2026-04-17	2,023,000	64,156	2,087,156	+59 %
v3 — v2 + URL/title fallbacks	2026-04-17	1,858,934	72,630	1,931,564	+47 %
2b — `step_callbacks`	2026-04-19	1,354,400	73,095	1,427,495	+9 %

2b sits 9% above the broken-reconcile floor — that’s the genuine cost of producing a correct dossier (callbacks fired, reconciliation run, full index built). Versus the working-but-bloated v3, the accumulator saves 504K tokens per dossier (−26%) while keeping every correctness gain.

I then ran the full pipeline against an unrelated dossier — a comparison between Dark Enlightenment (Thiel, Yarvin, Land) and Lumières vertes (Bruno Latour, Frédéric Keck) — as an end-to-end check. Result: 15 of 16 nodes persisted, the one drop was a real catch (a hallucinated radiofrance URL), and the average 6-gram overlap between tool observations and final excerpts came out at 100% across 11 nodes (vs the 0–68% spread on the free-party baseline). Mode C — “coordinator silently rewrites quotes” — is no longer a regression target.

Three open questions

This is where the post stops pretending to know the answer.

Is the coordinator agent overkill? Most of my prompts have a beginning, a middle, an end, an explicit chronology, sometimes the URLs spelled out. They look like specs more than open questions. A scripted DAG could probably handle 70% of what the coordinator does today — and a scripted DAG can’t hallucinate quotes because it never types them. But the coordinator does genuinely surface things I didn’t ask for: a podcast that closes the dossier on the right note, a YouTube documentary I’d forgotten existed, a Brave-search-discovered URL that ends up being the best source on the page. I’d lose that creativity. I’m not yet sure the trade is worth it.

Could LangGraph do better? Probably some things, yes — explicit state machines, better trace primitives, proper cycle handling. But it’s a real investment with a real learning curve, and the smolagents cost was zero. I built the whole feature on top of it in a couple of weeks. I’m not going to rewrite that on speculation. If the reconciliation pattern starts breaking down — multiple coordinators, true graph topologies, durable workflows — LangGraph becomes interesting. Today it doesn’t.

Are we still burning too many tokens? ~1.43M per dossier is not nothing. The compact JSON contract on specialists (a separate phase) saved ~10K. Tool-output trimming saved ~600K. The accumulator recovered most of what reset=False cost. The next obvious lever is the coordinator prompt itself (~5K tokens of system prompt that re-ingests on every step), but I don’t yet have a clean idea of what to cut. A scripted middle layer that pre-fetches deterministic sources before the coordinator even starts might be the right move. I don’t know yet.