The Formality Tax

March 4, 2026

Formal prompts cost about 2x for a 1.8-point quality gain. We ran 249 blind evaluations to find out what prompt tone actually does.

In the spring of 2026, a twenty-three-year-old developer in San Francisco named Marcus sat down to ask Claude to build him a marketing campaign. He typed the way he always typed — lowercase, no punctuation, half a sentence trailing off. "create a full marketing launch campaign for lumina," he wrote. "it's a new smart desk lamp... make sense? the copy should be compelling and on-brand. okay thats it. actually do the work, don't just yap about it."

Down the hall, his manager Sarah was doing the same thing — except she wasn't typing. She was using Wispr Flow, a dictation tool that restructures your speech into clean text, with the style set to Formal. She spoke naturally into her mic, and what came out the other end was: "You are tasked with creating a complete marketing launch campaign for 'Lumina.' Please proceed by using your available tools to complete all aspects of this task. I expect thorough, complete work."

Same task. Same AI. Same afternoon. Sarah was certain her version would produce better results. She'd read the LinkedIn posts. She'd configured her dictation style deliberately. She knew that how you talk to AI matters.

She was right about that last part. She was wrong about almost everything else.


The please-and-thank-you hypothesis

There is a widely held belief — not quite an old wives' tale, but close — that being polite and formal with AI produces better output. It makes intuitive sense. We know that context and framing matter for language models. We know that a well-structured prompt outperforms a vague one. And so a cottage industry has sprung up around prompt etiquette: use complete sentences, be specific, say please, set the stage.

But here's the thing nobody had actually tested: when you hold the information constant and only change the register — the linguistic formality — what happens?

I built an experiment to find out. Four AI models. Three real tasks. Four different prompt tones. 249 blind evaluations, scored by a judge that never saw the original prompt, the model name, or the tone used. The results upended nearly everything I expected.

The four voices

To understand the experiment, you need to understand the four prompts. Here's the key: they all ask for the same thing. The information is identical. Only the wrapper changes.

CASUAL

"create a full marketing launch campaign for lumina. it's a new smart desk lamp... make sense? the copy should be compelling and on-brand. okay thats it. actually do the work, don't just yap about it."

CONTROLLED

"create a full marketing launch campaign for lumina. it's a new smart desk lamp... use all your tools to get this done. do everything, don't skip anything. i want to see every single file written."

FORMAL

"You are tasked with creating a complete marketing launch campaign for 'Lumina'... Please proceed by using your available tools to complete all aspects of this task. I expect thorough, complete work."

KEYBOARD ERRORS

"craete a full marketing luanch campaign for lumina. its a new smart desk lamp... use all yuor tools to get tihs done. do everthing, dont skip anyhting. i wnat to see every single file writen."

The casual prompt is Marcus. The formal prompt is Sarah. But the two in the middle are where it gets interesting. The controlled prompt has the same thoroughness as formal — "do everything, don't skip anything" — but delivered in casual register. And the keyboard errors prompt is the controlled version typed by someone who can't be bothered to fix their typos.

If formality matters, controlled should score lower than formal. If spelling matters, keyboard errors should score lowest of all.

Neither of those things happened.

The chart that shouldn't look like this

Average Quality Score by Tone

Look at those bars. Really look at them. On a 100-point scale, the spread between all four tones is 2.6 points. The error bars overlap completely. If I hadn't labeled the chart, you couldn't tell which bar was which.

Formal scored 74.1. Casual scored 72.3. And keyboard errors — the one riddled with transpositions and dropped letters — scored 74.8.

Read that again. The typo-laden prompt outscored the formal one.

This is the moment in the story where the conventional wisdom starts to wobble. But it gets stranger.

The specificity principle

Formal vs Casual Delta by Dimension

Percentage point difference (formal − casual)

When I broke the scores down by dimension, a pattern emerged. Formal prompts didn't produce better quality. They didn't produce more creativity. What they produced was more thoroughness — and that's not the same thing.

And here's the crucial insight: the controlled prompt — casual register, complete information — scored 74.9%. Higher than formal. The model didn't care about the "please" and "You are tasked with." It cared about "do everything, don't skip anything."

I started calling this the specificity principle. The thing that moves the needle isn't how formally you frame a request. It's how completely you describe what you want. You can say it in a t-shirt and flip-flops. You can misspell half of it. As long as the information is there, the model gets it.

The deliberation trap

But if formality doesn't improve quality, what does it do? It turns out it does something very concrete. It makes models think more.

Total Token Usage by Tone

Formal prompts consumed 289,000 tokens on average. Casual consumed 150,000. That's not a subtle difference. That's almost double. And nearly all the extra tokens went to reasoning — the model re-reading its own work, deliberating, taking smaller and more careful steps.

Process Complexity by Tone

You can see it in the process data. Formal prompts produced 43% more steps, but fewer tool calls per step. The model stopped batching actions together and started working one careful piece at a time. It became, for lack of a better word, conscientious.

This is what I call the deliberation trap. Formal language triggers a mode of processing that is more thorough, more careful, and vastly more expensive — without producing meaningfully better results on average. It's like putting on a suit to do the same job you do in jeans. You might feel more professional, but the spreadsheet doesn't know what you're wearing.

The exception that proves the rule

But here's where it gets genuinely interesting. Because there is a case where formal prompts make a real difference. You just have to know where to look.

Quality Change: Formal vs Casual (by model)

Percentage point difference in quality score

Claude Opus — the largest, most capable model in the test — showed a 5-point quality gain from formal prompts. GPT-5.2 Codex showed 2.2 points. But the smaller models, Haiku and Mini, showed essentially nothing. Zero. The correlation between model size and tone sensitivity was 0.98. Near perfect.

Think about what that means. Small models are already working as hard as they can. They're at their ceiling. Asking them politely doesn't give them new capabilities. But large models are different. Large models have slack. They have a performance envelope they don't always fill, and tone is one of the things that nudges them toward the upper edge.


Marcus and Sarah, revisited

So who was right? Marcus, with his lowercase and his "don't just yap about it"? Or Sarah, with her structured formality and her "I expect thorough, complete work"?

The answer, it turns out, is neither. And both.

On average, Marcus got the same quality work at half the price. His casual prompts were 71% more token-efficient. For the majority of tasks — the everyday, the routine, the good-enough — his approach was strictly superior.

The real winner, though, was neither of them. It was their colleague Jamie, who typed casually but thoroughly. "do everything, don't skip anything. i want to see every single file written." No forced formality, just complete information about what they wanted.

Jamie's prompts scored highest of all.

There's something almost poetic about this. In most systems — corporate, academic, social — surface-level polish pays outsized returns. The confident delivery beats the better idea. The quick car wash adds hundreds to the used car. People learn, reasonably, that a little shine goes a long way.

But the language model charges you for the shine and gives you back roughly the same car. Say what you mean, completely, in whatever voice is yours. That's enough.


249 blind-scored evaluations. Full methodology, prompts, and data on GitHub.