Engagement surveys are theatre

I’m the CTO of a company building an alternative to engagement surveys. So take everything below with that in mind. I’ll come back to it at the end.

Most engagement surveys are theatre. Someone runs one every quarter, leadership gets a number, the number is mostly fine, the slide deck circulates, and very little actually changes. Culture Amp, Lattice, 15Five, the homegrown Google Form, it doesn’t really matter which. The newer AI-summary version, where an LLM reads the free-text answers and writes a paragraph of themes for you, doesn’t actually fix anything either; it just runs the same broken instrument with less staffing.

What bugs me about it is that we used to do this kind of research well, and then somewhere along the way we stopped, and almost nobody seems to remember why.

We used to do qualitative

For most of the twentieth century, if you wanted to understand a group of people, you talked to them. Anthropologists did this for a living. Malinowski sat in the Trobriand Islands for years writing things down. Margaret Mead lived in Samoa. The output was thick description, by which they meant: specific, situated, hard to summarise in a number, full of things the researcher hadn’t gone in looking for.

It got formal eventually. Glaser and Strauss published grounded theory in 1967. You interview people, transcribe, code the transcripts by hand, let themes emerge from the data instead of from the hypothesis. Braun and Clarke updated the recipe in 2006 and called it thematic analysis. The recipe is the same. Sample sizes were tiny by today’s standards, ten to thirty people, but the depth was real.

This is also how we used to study organisations. Hawthorne, industrial ethnography, Studs Terkel’s Working. You couldn’t roll it up into a dashboard but you actually knew what was going on inside the place.

Scale broke it

You cannot interview thirty thousand employees by hand. By the 1930s organisational psychology was already drifting towards instruments that scaled. Rensis Likert published his scale in 1932 to turn attitudes into numbers. Seventy years later Fred Reichheld gave the world NPS in a Harvard Business Review piece, and the modern engagement-survey industry was born off the back of it.

The trade was obvious at the time even if nobody wrote it down. Quantitative gave up the why and kept the how many. The instrument can only detect what you ask. The aggregate becomes the only thing you can look at, because the aggregate is the only thing that scales.

For seventy years there was no third option. Either you ran the survey and learned a thin thing about everyone, or you ran interviews and learned a deep thing about almost nobody. Companies above a certain size ran the survey because they had to. The theatre was the cost of doing business.

What the survey actually costs

The trade-off shows up as four specific failures, all coming from the same underlying decision.

The questions are written in advance, which means you only find what you went in looking for. If the real problem in your org this quarter is that your best PM is being slowly squeezed out by a peer, and you didn’t think to put a question about that on the form, the survey is not going to surface it for you.

The unit of analysis is the aggregate, not the individual. A team of eight where two people are quietly checked out averages out, the score is 7.8, leadership reads “fine” and moves on. Then the two people leave a quarter later, the score drops to 6.2, and now there’s a retention conversation happening three months too late.

Anonymity protects the answer but blocks the follow-up. Someone tells you something real and there’s no way to ask “what would you change?” without breaking the very protection that made the honest answer possible. The most useful signals come back with no thread you can pull on.

The cadence is wrong on top of all of that. Quarterly is too slow. The thing the person was reacting to when they typed the answer has already faded by the time the analysis lands. The team reorged, the manager changed, the project shipped, and you’re effectively reading ghosts.

Klarna had green dashboards too

Klarna replaced 700 customer-service agents with a chatbot in 2023. The internal metrics didn’t flag anything until the quality dropped and they had to walk it back. Sebastian Siemiatkowski said it in plain words: they focused too much on efficiency and cost, the quality dropped, and it wasn’t sustainable.

I think about this often, because the same shape exists inside almost every company in the engagement direction. The dashboards read green, the eNPS is +20, then a senior engineer leaves, and her exit interview surfaces a thing five other people would have told you six months earlier, if you had known to ask the right question, of the right person, at the right time.

Why LLMs change the maths

The reason organisations stopped doing qualitative at scale wasn’t that qualitative was worse. It was that running a real interview, transcribing it, coding it, and analysing it across thousands of people was prohibitively expensive in human-hours. LLMs collapse that cost. Not entirely. Enough.

A trained interviewer doesn’t ask the next question from a list. They follow what the person just said, ask the follow-up that makes sense, and only loop back to the fixed dimensions when there’s room. An LLM can do this. The model holds the state of the conversation, knows what the org is trying to learn this round, and picks the next question based on the actual answer instead of a script. The pre-defined dimensions become a fallback, not the spine.

Once you have thousands of transcripts you can also stop asking the question bank to do the analysis. Embed the utterances, cluster the embeddings, then read representative samples from each cluster and write a clean theme label. That’s automated grounded theory, more or less. The themes you find are the themes that are actually in the data, not the themes you guessed at when you wrote the survey.

The most interesting technical part is the follow-up. You want to ask “what happened next?” without revealing who answered to anyone downstream. The way around it is to split the system. The interview layer holds identity. The analysis layer never sees it. The agent can ask twelve adaptive follow-ups in a single session, then identity is stripped before any human or aggregation step touches the transcript. You get the follow-up. You don’t get a paper trail back to the person.

The last piece is cadence. Once the per-interview cost is near zero, quarterly stops being the obvious frequency. You can run a five-minute conversation every couple of weeks, with the question shaped by what’s actually happening in the org that week. The context isn’t a ghost anymore.

Where this actually gets hard

None of this comes for free, and I don’t want to pretend it does. The pipeline matters a lot more than the model, and the pipeline is where most of the products in this space are still pretty weak.

Models will hallucinate themes if you let them. A badly built pipeline will confidently produce a theme that isn’t actually in the data, just because there’s a coherent story it can tell. The fix is unglamorous: every theme has to point back to specific utterances, and the utterances have to ship alongside the theme so a human can sanity-check. If you can’t ground a theme in evidence, it doesn’t go in the report. That’s a discipline you build into the workflow, not a feature you buy.

Embeddings cluster on surface similarity, which isn’t the same as conceptual similarity. “I love my manager” and “I hate my manager” land closer in embedding space than you’d want, because most of the tokens are the same. You end up needing a small classifier on top of the clusters, and you need human spot checks on the cluster boundaries until you trust the pipeline.

Anonymity is also harder than people expect. Modern models can identify writers from style with uncomfortable accuracy, so stripping names off transcripts is nowhere near enough. You need a paraphrase pass before downstream analysis, real service-level separation between the system that holds identity and the system that does the analysis, and some care at the agent layer to avoid reproducing the person’s voice too faithfully in the transcript.

There’s also the adversarial side. People will try to break the system with prompt injection or deliberately skewed answers, and you have to plan for it the way you’d plan for any agent that ingests untrusted text. It’s all standard hygiene at this point, just applied to a context where most teams haven’t thought about it yet.

The honest summary is that Sonnet, GPT, and Gemini are all good enough at the conversation now. The hard part is everything around the conversation, which is also why most “AI for engagement” products today still feel flat. Bolting an LLM-summary on top of a Likert-scale survey gives you the worst of both worlds: the original trade-off is preserved, and now there’s a hallucination risk layered on top of it.

What this actually looks like

A research instrument with the depth of a one-to-one interview and the coverage of a survey. The unit of analysis is the conversation. Aggregation happens on top of grounded data, not in place of it. Leadership gets themes with quotes and traceable evidence instead of a single number.

Questions adapt. The individual is the unit, not the aggregate. The agent does the follow-up before identity is stripped. The cadence is continuous. The four costs of the original trade get addressed in order instead of papered over with an LLM summary at the end.

That’s what we’re building at Spradley AI.

About the bias

I’d have written this post even if I hadn’t ended up working on the alternative, because the survey industry is one of the more obviously broken parts of modern work, and the genuinely interesting thing about LLMs in this space is not that they let you summarise broken surveys a bit faster. It’s that they finally make it possible to undo a seventy-year trade-off the industry has been quietly carrying since Likert. That’s worth talking about whether or not I happened to build a company around it.

So yes, I’m biased. I still think the argument stands on its own without me.

← All posts