The Formality Tax — Ethan Woo

In the spring of 2026, a twenty-three-year-old developer in San Francisco named Marcus sat down to ask Claude to build him a marketing campaign. He typed the way he always typed — lowercase, no punctuation, half a sentence trailing off. "create a full marketing launch campaign for lumina," he wrote. "it's a new smart desk lamp... make sense? the copy should be compelling and on-brand. okay thats it. actually do the work, don't just yap about it."

Down the hall, his manager Sarah was doing the same thing — except she wasn't typing. She was using Wispr Flow, a dictation tool that restructures your speech into clean text, with the style set to Formal. She spoke naturally into her mic, and what came out the other end was: "You are tasked with creating a complete marketing launch campaign for 'Lumina.' Please proceed by using your available tools to complete all aspects of this task. I expect thorough, complete work."

Same task. Same AI. Same afternoon. Sarah was certain her version would produce better results. She'd read the LinkedIn posts. She'd configured her dictation style deliberately. She knew that how you talk to AI matters.

She was right about that last part. She was wrong about almost everything else.

The please-and-thank-you hypothesis

There is a widely held belief — not quite an old wives' tale, but close — that being polite and formal with AI produces better output. It makes intuitive sense. We know that context and framing matter for language models. We know that a well-structured prompt outperforms a vague one. And so a cottage industry has sprung up around prompt etiquette: use complete sentences, be specific, say please, set the stage.

But here's the thing nobody had actually tested: when you hold the information constant and only change the register — the linguistic formality — what happens?

I built an experiment to find out. Four AI models. Three real tasks. Four different prompt tones. 249 blind evaluations, scored by a judge that never saw the original prompt, the model name, or the tone used. The results upended nearly everything I expected.

The four voices

To understand the experiment, you need to understand the four prompts. Here's the key: they all ask for the same thing. The information is identical. Only the wrapper changes.

CASUAL

"create a full marketing launch campaign for lumina. it's a new smart desk lamp... make sense? the copy should be compelling and on-brand. okay thats it. actually do the work, don't just yap about it."

CONTROLLED

"create a full marketing launch campaign for lumina. it's a new smart desk lamp... use all your tools to get this done. do everything, don't skip anything. i want to see every single file written."

FORMAL

"You are tasked with creating a complete marketing launch campaign for 'Lumina'... Please proceed by using your available tools to complete all aspects of this task. I expect thorough, complete work."

KEYBOARD ERRORS

"craete a full marketing luanch campaign for lumina. its a new smart desk lamp... use all yuor tools to get tihs done. do everthing, dont skip anyhting. i wnat to see every single file writen."

The casual prompt is Marcus. The formal prompt is Sarah. But the two in the middle are where it gets interesting. The controlled prompt has the same thoroughness as formal — "do everything, don't skip anything" — but delivered in casual register. And the keyboard errors prompt is the controlled version typed by someone who can't be bothered to fix their typos.

If formality matters, controlled should score lower than formal. If spelling matters, keyboard errors should score lowest of all.

Neither of those things happened.

The chart that shouldn't look like this

Average Quality Score by Tone

Look at those bars. Really look at them. On a 100-point scale, the spread between all four tones is 2.6 points. The error bars overlap completely. If I hadn't labeled the chart, you couldn't tell which bar was which.

Formal scored 74.1. Casual scored 72.3. And keyboard errors — the one riddled with transpositions and dropped letters — scored 74.8.

Read that again. The typo-laden prompt outscored the formal one.

This is the moment in the story where the conventional wisdom starts to wobble. But it gets stranger.

The specificity principle

Formal vs Casual Delta by Dimension

Percentage point difference (formal − casual)

When I broke the scores down by dimension, a pattern emerged. Formal prompts didn't produce better quality. They didn't produce more creativity. What they produced was more thoroughness — and that's not the same thing.

And here's the crucial insight: the controlled prompt — casual register, complete information — scored 74.9%. Higher than formal. The model didn't care about the "please" and "You are tasked with." It cared about "do everything, don't skip anything."

I started calling this the specificity principle. The thing that moves the needle isn't how formally you frame a request. It's how completely you describe what you want. You can say it in a t-shirt and flip-flops. You can misspell half of it. As long as the information is there, the model gets it.

The deliberation trap

But if formality doesn't improve quality, what does it do? It turns out it does something very concrete. It makes models think more.

Total Token Usage by Tone

Formal prompts consumed 289,000 tokens on average. Casual consumed 150,000. That's not a subtle difference. That's almost double. And nearly all the extra tokens went to reasoning — the model re-reading its own work, deliberating, taking smaller and more careful steps.

Process Complexity by Tone

You can see it in the process data. Formal prompts produced 43% more steps, but fewer tool calls per step. The model stopped batching actions together and started working one careful piece at a time. It became, for lack of a better word, conscientious.

This is what I call the deliberation trap. Formal language triggers a mode of processing that is more thorough, more careful, and vastly more expensive — without producing meaningfully better results on average. It's like putting on a suit to do the same job you do in jeans. You might feel more professional, but the spreadsheet doesn't know what you're wearing.

The exception that proves the rule

But here's where it gets genuinely interesting. Because there is a case where formal prompts make a real difference. You just have to know where to look.

Quality Change: Formal vs Casual (by model)

Percentage point difference in quality score

Claude Opus — the largest, most capable model in the test — showed a 5-point quality gain from formal prompts. GPT-5.2 Codex showed 2.2 points. But the smaller models, Haiku and Mini, showed essentially nothing. Zero. The correlation between model size and tone sensitivity was 0.98. Near perfect.

Think about what that means. Small models are already working as hard as they can. They're at their ceiling. Asking them politely doesn't give them new capabilities. But large models are different. Large models have slack. They have a performance envelope they don't always fill, and tone is one of the things that nudges them toward the upper edge.

Marcus and Sarah, revisited

So who was right? Marcus, with his lowercase and his "don't just yap about it"? Or Sarah, with her structured formality and her "I expect thorough, complete work"?

The answer, it turns out, is neither. And both.

On average, Marcus got the same quality work at half the price. His casual prompts were 71% more token-efficient. For the majority of tasks — the everyday, the routine, the good-enough — his approach was strictly superior.

The real winner, though, was neither of them. It was their colleague Jamie, who typed casually but thoroughly. "do everything, don't skip anything. i want to see every single file written." No forced formality, just complete information about what they wanted.

Jamie's prompts scored highest of all.

There's something almost poetic about this. In most systems — corporate, academic, social — surface-level polish pays outsized returns. The confident delivery beats the better idea. The quick car wash adds hundreds to the used car. People learn, reasonably, that a little shine goes a long way.

But the language model charges you for the shine and gives you back roughly the same car. Say what you mean, completely, in whatever voice is yours. That's enough.

249 blind-scored evaluations. Full methodology, prompts, and data on GitHub.

There's a generational divide in how people talk to AI. Some write prompts like formal emails to a colleague they've never met. Others type like they're texting a friend. Most of us have wondered at some point: does it actually matter?

I ran an experiment to find out. 249 blind evaluations across 4 models, 4 prompt tones, and 3 real tasks. The answer is more complicated — and more interesting — than you'd expect.

The headline numbers

+1.8 pts

quality gain (formal vs casual)

+92%

token cost increase

In this dataset, formal prompts produced slightly better average output. But they cost almost twice as much. And the effect depends heavily on which model you're using.

The experiment

The core matrix was straightforward: four models, three primary tones, three real-world tasks (copywriting campaign, coding challenge, file organization), and five trials per cell. That's 180 runs. I then added a keyboard-errors condition plus follow-up runs, bringing the final dataset to 249 blind-scored outputs.

A blind judge (Kimi K2.5, chosen specifically because it's a different model family) scored each output on four dimensions: quality, thoroughness, creativity, and adherence to instructions. The judge never saw the original prompt, the model name, or the tone used.

The four tones

The key design choice: the casual and formal prompts contain the same information, just expressed differently. The controlled prompt is the interesting one — it has the same complete information as formal, but uses casual register. And keyboard-errors takes controlled and adds realistic fast-typing mistakes (transpositions, dropped letters, missing apostrophes) — testing whether surface-level messiness degrades output.

CASUAL

CONTROLLED

"create a full marketing launch campaign for lumina. it's a new smart desk lamp... use all your tools to get this done. do everything, don't skip anything. i want to see every single file written."

FORMAL

KEYBOARD ERRORS

"craete a full marketing luanch campaign for lumina. its a new smart desk lamp... use all yuor tools to get tihs done. do everthing, dont skip anyhting. i wnat to see every single file writen."

Same task. Same deliverables. Different vibes.

1. Tone doesn't meaningfully change output quality

Average Quality Score by Tone

Formal prompts scored 74.1%. Casual scored 72.3%. Keyboard errors scored 74.8%. That's only a 2.6-point spread across all four tones, with overlapping uncertainty bands in the aggregate chart.

But what little difference exists isn't evenly distributed across quality dimensions:

Formal vs Casual Delta by Dimension

Percentage point difference (formal − casual)

In this dataset, thoroughness drives most of the gap (+3.5 points). Formal prompts seem to make models act more deliberate: more steps, more detail, more completeness. Creativity, by contrast, is flat. Formal prompts don't look more creative here. They look more thorough.

2. Information matters more than politeness

Here's the most actionable finding. The controlled tone — casual register, complete information — scored 74.9%, slightly higher than formal (74.1%). The controlled prompt says "actually do the work" and "don't skip anything" in casual language, and it works just as well as formal framing.

That suggests the quality effect isn't really about being polite to the AI. It's more about being specific about what you want. You can be casual and still get formal-quality output if the instructions are thorough.

The keyboard-errors condition points in the same direction: prompts riddled with typos and transpositions scored 74.8% — virtually identical to controlled (74.9%) and formal (74.1%). In this setup, models appeared far more sensitive to specificity than spelling.

3. The cost problem

Total Token Usage by Tone

Formal prompts consumed 289K tokens on average, vs 150K for casual — a 92% increase. Almost all the extra tokens went to reasoning, not output. For a small average quality difference, you're paying about double.

4. Larger models are more sensitive to tone

This was the most surprising pattern in the results. Most of the measured tone effect is concentrated in large models.

Quality Change: Formal vs Casual (by model)

Percentage point difference in quality score

Claude Opus shows a +5.0-point quality gain from formal prompts. GPT-5.2 Codex shows +2.2. Haiku and Mini are essentially flat. One plausible interpretation: larger models have more room to trade extra deliberation for slightly better output, while smaller models are closer to their ceiling on these tasks.

I wouldn't overstate this. The sample is still modest. But the directional pattern is consistent: if tone matters, it seems to matter more for bigger models than for smaller ones.

5. Formal changes the process, not just the output

Process Complexity by Tone

Formal prompts were associated with 43% more steps (18.4 vs 12.9). But they used fewer tool calls per step — 1.10 vs 1.30. Instead of batching actions together, the model tended to take many smaller, more deliberate steps.

Where do the extra tokens go? Mostly to reasoning. Output is just 5-7% of total token usage across all tones — the vast majority is input tokens, the accumulated context from each step. Input tokens go from 76K to 164K with formal prompts, likely because the model keeps re-reading more context before each careful next step. Formal tone appears to make the model more deliberate, and deliberation is expensive.

So what should you actually do?

On average, in this dataset, you'll get roughly the same quality output at about half the cost with casual prompts. That's the headline finding.

The most practical finding: be specific about what you want, in whatever tone feels natural. The controlled prompt — casual register, thorough instructions — scored as high as formal at nearly casual cost. Specificity seems to matter more than formality.

249 blind-scored evaluations. Full methodology, prompts, and data on GitHub.