Blameless post-mortems: what I owe to John Allspaw and the Etsy engineering team
by Sylvain Artois on Apr 2, 2026
- #Management
Wednesday, April 1st 2026, 8:35 AM, Meaux train station. I had scheduled a LinkedIn post at 8:30. I need to paste a link in the comments. I look for the URL on my tech blog.
Unreachable.
The incident
8:37 AM. The connection is unreliable at the station, so I do what everyone does: I type “test” into DuckDuckGo. Results show up. The internet works. My site does not.
8:38 AM. I check my other services. afk.live, my news aggregator — down. My literary journal sylvain.artois.me — down. My Bluesky instance — down. Personal Global Outage… Everything I host is down. I am not Etsy, I get ten visits a day on my tech blog and about a hundred on afk.live. The world will keep turning. But I have just launched kairos.ing, my consulting company. This is not a good time.
8:39 AM. I contact my partner. At the heart of my infrastructure, sitting on the living room mantelpiece, there is an old self-hosted Linux machine, deliberately cut off from the internet. Is it up? We both work from home, and I refuse to open remote access on this machine — a matter of security.
9:07 AM. Answer: the home server is running. The problem comes from my Scaleway instance, the one serving the pre-rendered HTML pages of my websites. And now I know. I know exactly what happened, because I have already experienced this failure. The week before. My VPS rebooted overnight — probably a maintenance operation from the hosting provider — and my Docker stack restarted without the --env-file argument. Without its environment variables, the web server cannot function.
9:20 AM. I arrive in Paris. I log into the Scaleway back office to add my SSH key on my Mac. Thirty minutes wasted. If the key had already been there, I could have restarted everything from the train.
9:30 AM. Everything is up. I post my comment on LinkedIn. Total downtime: about two hours. Actual resolution time: ten minutes.
The questions
So many questions. I will keep two.
Why did I have no monitoring probes on my services? A simple alert would have notified me at 7:30 AM, coffee in hand, at my desk. I would have resolved the incident before it even had consequences.
Why was Docker not configured to restart automatically with the right configuration? I had already encountered this failure. I had fixed it by hand. I had not fixed the system.
The actions
Two things to do today, not tomorrow:
- Set up monitoring probes on both servers. And be clever about it: if Scaleway goes down, the self-hosted server will not be able to report its own status through the usual channels. A cross-monitoring mechanism will be needed.
- Configure Docker so that it always restarts with the right environment file. No more forgotten
--env-file.
A careful reader will notice that this little story follows a precise pattern. A factual timeline, without judgement. Open questions, directed at the system rather than at the person. Concrete, immediate actions. No one to blame. No “I should have”. No shame.
I did not invent this pattern. I learned it.
Paris, 2014–2017
I joined Etsy France in 2014. For those who may not know, Etsy is an American marketplace dedicated to handmade and creative goods. But behind the warm storefront of handmade jewellery and leather notebooks, there was an engineering team that, at the time, was quietly redefining the way our industry thinks about incidents, errors and responsibility.
Leading this quiet revolution: John Allspaw, Etsy’s CTO. Allspaw is not a manager like the others. He is a researcher disguised as an engineer, a man who cites Sidney Dekker1 and Erik Hollnagel2 in his articles on Etsy’s engineering blog, and who held the conviction — rare in our line of work — that failures are never the fault of a single person.
In 2016, Allspaw came to visit us in Paris. I remember that day with particular clarity. There are professional encounters you forget, and there are those that shift something in the way you think, permanently.
He talked to us about the blameless post-mortem.
What “blameless” really means
The term is often misunderstood. “Blameless” does not mean that everyone gets away without being held accountable. It is not amnesty. It is not indifference.
It is a pact. A pact that says: if you tell me exactly what you did, what you observed, what you expected would happen, and why your action seemed reasonable at the time — then I commit to not punishing you for it. Because punishing you would cost me far more than your mistake.
Allspaw called this a Just Culture. The idea is clear: an engineer who fears punishment will never give the details needed to truly understand a failure. They will minimise, they will smooth things over, they will do Cover-Your-Ass engineering. And without those details, we do not understand what happened. And what we do not understand, we relive.
Allspaw described the toxic cycle with surgical precision: an engineer makes a mistake, they get punished, trust erodes between the field and management, engineers go silent, management loses visibility on the reality of daily work, latent conditions for failure build up in the shadows, and the next outage arrives. More violent. Less understood.
The “Second Story”
What struck me most in Allspaw’s approach is the concept of the Second Story. When an incident occurs, there is always a first story, the one that jumps out: “the engineer was not paying attention”, “they did not follow the procedure”, “they deployed without checking”. A first story is reassuring. It gives you someone to blame, a simple cause, and the illusion that telling people to be more careful will make the problem disappear.
The second story is richer, more uncomfortable, and infinitely more useful. It asks: why did this engineer’s action seem reasonable to them at the time they took it? What was their context? What information did they have? What information were they missing? What did the system — the tooling, the interfaces, the culture, the pressure — make possible, or even likely?
Allspaw cited Erik Hollnagel: “Accidents don’t happen because people gamble and lose. They happen because the person does not believe that the accident that is about to occur is at all possible, or because the person believes it has no connection to what they are doing, or because the desired outcome is considered worth the risk.”
In the debriefing facilitation guide that Etsy published as open source in 20163, there is an example that has never left me. An engineer deploys code, causes an outage. In the post-mortem meeting, he lowers his head: “I just wasn’t paying attention, I guess. This is on me.” The team is relieved. We have our culprit. We can fill in the form and move on.
But the facilitator does not let go. He asks the engineer to go back, to describe his actions, to show his screen. And there, we discover that a recently redesigned dashboard displays the number 8 in an italic font with slashed zeroes — and that this 8 looks remarkably like a 0. Half the room had read zero too. The problem was not one man’s inattention. It was a typographic choice in an interface.
Without the right questions, we would never have found it. We would have told Phil to “pay more attention next time”. And someone else, a month later, would have read the same zero.
The facilitator
What I also learned at Etsy is that the blameless post-mortem does not work on its own. You need a facilitator. Someone whose role is not to judge, not even to solve, but to make people talk. To gather multiple perspectives. To resist the temptation of the single cause. To ask the questions nobody asks because they seem naive, or because the answer is frightening.
Todd Conklin, often cited by Allspaw’s team, puts it better than anyone: “The skill is not in knowing the right answer. Right answers are pretty easy. The skill is in asking the right question. The question is everything.”
What I owe them
I could tell the story of my April 1st incident as a simple story: I forgot to configure Docker correctly, I was careless, I lost two hours, it is my fault. First story. It fits in one line. And it teaches me nothing.
The second story is more interesting. Why did I have no monitoring? Because I work alone, because my infrastructure grew in successive layers, and because monitoring is the kind of thing you postpone when you have no on-call rotation and no colleague to remind you it matters. Why did Docker not have the right restart configuration? Because I had fixed the symptom the first time without addressing the cause, pressed by something else — a backlog, a daily urgency. The patch had held for ten days.
I am alone on this infrastructure, I have no facilitator, no meeting room, no team to gather. And yet, the reflex is there. Factual timeline. Open questions about the system. Concrete actions. No one to blame — even when the one to blame is me.
This reflex, I owe it to the three years spent with the engineering team at Etsy France, between 2014 and 2017. I owe it to John Allspaw, of course, but also to all the technical leaders who carried this culture every day, in stand-ups, in code reviews, in architecture reviews, in hallway conversations. People who had understood that the safety of a system is not built on fear, but on trust.
Organisations, groups of human beings, look for someone to blame. Scapegoats4. I have witnessed this throughout my career. “It’s the fault of the dev who just resigned” — the absent are always wrong. Worse: “It’s the fault of the intern, a bit of an odd one, he didn’t follow the procedure.” The most vulnerable pay for everyone.
Sometimes you are in very caring environments, and people still cannot bring themselves to speak openly. For fear of hurting feelings? But that is precisely what “blameless” is about: it is a framework so that caring teams are not afraid to dig into complex incidents and trace back to root causes — always complex, always multi-layered.
Sometimes you are in a startup fighting for survival, and you apply a patch so you can get back to the backlog and the daily emergencies. The patch will hold for ten days.
We owe it to ourselves to run blameless post-mortems. Not because it is fashionable, or because a blog post invited us to. But because it is the only honest way to make our organisations progress. To turn every outage into a lesson rather than a trial. To look at our systems — technical and human — with enough courage to see what does not work, without looking away toward the first convenient culprit.
This is what I learned at Etsy. This is what I will not forget.
Footnotes
-
Sidney Dekker is a Swedish researcher in human factors and safety of complex systems. His book The Field Guide to Understanding ‘Human Error’ (2014) has profoundly influenced the way the software industry thinks about incidents. He is the one who formalised the distinction between first story and second story — and the critique of what he calls the “Bad Apple Theory”, the idea that eliminating the bad elements would be enough to eliminate errors. ↩
-
Erik Hollnagel is a Danish researcher, a pioneer of resilience engineering. Where the traditional approach to safety consists of cataloguing what goes wrong, Hollnagel proposes studying what goes right — how humans adapt, compensate, and keep systems running despite their complexity. Allspaw cites him extensively in his foundational article, Blameless PostMortems and a Just Culture (2012): “Accidents don’t happen because people gamble and lose.” ↩
-
Etsy’s Debriefing Facilitation Guide for Blameless Postmortems (2016). Allspaw details the concrete practice of the post-mortem: the role of the facilitator, the art of asking the right questions, and why a debriefing should first and foremost be an opportunity to learn, not to correct. The full guide is available as open source on GitHub. ↩
-
See René Girard, Violence and the Sacred (1972). Scapegoating is a fundamental reflex of human groups: singling out a victim to avoid questioning collective structures. Girard reads the Gospels as the first narrative that takes the side of the victim rather than that of the crowd. It is precisely this compassion toward the victim that underpins the blameless approach: refusing to sacrifice an individual to preserve the group’s comfort, and having the courage to examine the system rather than look for someone to blame. ↩