Enterprise AI Research Analysis

Does Tone Change the Answer? Evaluating Prompt Politeness Effects on Modern LLMs: GPT, Gemini, LLaMA

Hanyu Cai†*, Binqi Shen†, Lier Jin, Lan Hu, Xiaojing Fan
Northwestern University, Duke University, Carnegie Mellon University, New York University
†Equal contribution *Corresponding Author

Abstract: Prompt engineering has emerged as a critical factor influencing large language model (LLM) performance, yet the impact of pragmatic elements such as linguistic tone and politeness remains underexplored, particularly across different model families. In this work, we propose a systematic evaluation framework to examine how interaction tone affects model accuracy and apply it to three recently released and widely available LLMs: GPT-4o mini (OpenAI), Gemini 2.0 Flash (Google DeepMind), and Llama 4 Scout (Meta). Using the MMMLU benchmark, we evaluate model performance under Very Friendly, Neutral, and Very Rude prompt variants across six tasks spanning STEM and Humanities domains, and analyze pairwise accuracy differences with statistical significance testing. Our results show that tone sensitivity is both model-dependent and domain-specific. Neutral or Very Friendly prompts generally yield higher accuracy than Very Rude prompts, but statistically significant effects appear only in a subset of Humanities tasks, where rude tone reduces accuracy for GPT and Llama, while Gemini remains comparatively tone-insensitive. When performance is aggregated across tasks within each domain, tone effects diminish and largely lose statistical significance. Compared with earlier researches, these findings suggest that dataset scale and coverage materially influence the detection of tone effects. Overall, our study indicates that while interaction tone can matter in specific interpretive settings, modern LLMs are broadly robust to tonal variation in typical mixed-domain use, providing practical guidance for prompt design and model selection in real-world deployments.

Schedule Your Strategy Session

Executive Impact: Key Research Metrics

Understanding the scope and rigor of the evaluation is crucial for enterprise decision-making.

3 LLMs Evaluated

6 MMMLU Tasks

3 Tone Variants

10 Trials per Question

Deep Analysis & Enterprise Applications

Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.

Model Architectures

Details on GPT-4o mini, Gemini 2.0 Flash, and Llama 4 Scout, their design philosophies, and why they were chosen for this study.

Evaluation Methodology

Explanation of the MMMLU benchmark, prompt engineering for tone spectrum (Neutral, Very Friendly, Very Rude), and statistical analysis methods.

Empirical Findings

Presentation of results on tone sensitivity across different LLMs and domains, highlighting statistically significant effects in Humanities tasks vs. STEM.

Implications & Future Work

Discussion of the practical guidance for prompt design and model selection, and potential future research directions.

+3.11% GPT-4o mini accuracy increase with Neutral vs Very Rude prompts in Philosophy (SS)

Enterprise Process Flow

Select LLMs (GPT, Gemini, Llama)

→

Choose MMMLU Tasks (STEM, Humanities)

→

Engineer Prompt Tone (Neutral, Friendly, Rude)

→

Conduct 10 Trials/Question

→

Analyze Accuracy Differences & Significance

LLM Tone Sensitivity Comparison

Model	Humanities Tasks	STEM Tasks
GPT-4o mini	Statistically significant tone effects in Philosophy (Neutral > Rude, Friendly < Neutral) Overall Neutral/Friendly > Rude	Overall Neutral/Friendly > Rude Effects diminish when aggregated
Gemini 2.0 Flash	No statistically significant tone effects Comparatively tone-insensitive	No statistically significant tone effects Comparatively tone-insensitive
Llama 4 Scout	Statistically significant tone effects in Philosophy (Neutral > Rude), Professional Law (Neutral > Rude) Overall Neutral/Friendly > Rude	Overall Neutral/Friendly > Rude Effects diminish when aggregated

Implications for Prompt Engineering

The study reveals that prompt tone sensitivity is both model-dependent and domain-specific. For interpretive settings like Humanities, tone can materially impact accuracy, with neutral or friendly tones generally outperforming rude ones. However, for typical mixed-domain usage and aggregated tasks, modern LLMs demonstrate strong robustness to tonal variation. This provides crucial guidance for designing prompts and selecting models in real-world deployments, suggesting that while specific nuances matter, broad applications are less susceptible to subtle tonal shifts.

Humanities tasks: Tone effects more pronounced, often statistically significant.
STEM tasks: Tone effects positive but statistically weaker.
Gemini 2.0 Flash: Minimal tone sensitivity across all tasks.

Calculate Your Potential AI ROI

Optimizing LLM prompt design for specific domains can significantly enhance operational efficiency by improving accuracy and reducing errors in AI-driven tasks.

Your Industry

Number of Employees Working with LLMs

Average Weekly Hours per Employee on LLM Tasks

Average Hourly Cost per Employee ($)

Annual Cost Savings

Hours Reclaimed Annually

Your AI Implementation Roadmap

A phased approach to integrate prompt engineering insights into your enterprise operations.

Phase 1: Initial LLM Selection & Prompt Baseline

Identify target LLMs and establish baseline prompt performance with neutral tone. Select relevant domain-specific tasks from MMMLU.

Phase 2: Tone Variation Experimentation

Systematically test Very Friendly and Very Rude prompt variants across selected LLMs and tasks. Collect accuracy data for each variant over repeated trials.

Phase 3: Performance Analysis & Benchmarking

Analyze pairwise accuracy differences and conduct statistical significance testing. Compare tone sensitivity across model families and domains.

Phase 4: Deployment & Iterative Optimization

Integrate findings into prompt engineering guidelines. Select LLMs based on domain-specific tone robustness. Continuously monitor and refine prompts in production.

Discuss Your Implementation Roadmap

Ready to Optimize Your LLM Strategy?

Leverage cutting-edge research to build more robust and effective AI applications for your enterprise.

Book a Free Consultation

Enterprise AI Research Analysis

Does Tone Change the Answer? Evaluating Prompt Politeness Effects on Modern LLMs: GPT, Gemini, LLaMA

Executive Impact: Key Research Metrics

Deep Analysis & Enterprise Applications

Model Architectures

Evaluation Methodology

Empirical Findings

Implications & Future Work

Enterprise Process Flow

LLM Tone Sensitivity Comparison

Implications for Prompt Engineering

Calculate Your Potential AI ROI

Your AI Implementation Roadmap

Phase 1: Initial LLM Selection & Prompt Baseline

Phase 2: Tone Variation Experimentation

Phase 3: Performance Analysis & Benchmarking

Phase 4: Deployment & Iterative Optimization

Ready to Optimize Your LLM Strategy?

Ready to Get Started?

Book Your Free Consultation.

Let's Discuss Your AI Strategy!

Lets Discuss Your Needs

Select Time Zone

Big Competitive Advantage With Ai

Learn More

Our Demos

Research Center

Contact Us

1 888 985 3025

Solutions@OwnYourAi.com

Get Your Ai