Enterprise AI Research Analysis
Does Tone Change the Answer? Evaluating Prompt Politeness Effects on Modern LLMs: GPT, Gemini, LLaMA
Hanyu Cai†*, Binqi Shen†, Lier Jin, Lan Hu, Xiaojing Fan
Northwestern University, Duke University, Carnegie Mellon University, New York University
†Equal contribution *Corresponding Author
Abstract: Prompt engineering has emerged as a critical factor influencing large language model (LLM) performance, yet the impact of pragmatic elements such as linguistic tone and politeness remains underexplored, particularly across different model families. In this work, we propose a systematic evaluation framework to examine how interaction tone affects model accuracy and apply it to three recently released and widely available LLMs: GPT-4o mini (OpenAI), Gemini 2.0 Flash (Google DeepMind), and Llama 4 Scout (Meta). Using the MMMLU benchmark, we evaluate model performance under Very Friendly, Neutral, and Very Rude prompt variants across six tasks spanning STEM and Humanities domains, and analyze pairwise accuracy differences with statistical significance testing. Our results show that tone sensitivity is both model-dependent and domain-specific. Neutral or Very Friendly prompts generally yield higher accuracy than Very Rude prompts, but statistically significant effects appear only in a subset of Humanities tasks, where rude tone reduces accuracy for GPT and Llama, while Gemini remains comparatively tone-insensitive. When performance is aggregated across tasks within each domain, tone effects diminish and largely lose statistical significance. Compared with earlier researches, these findings suggest that dataset scale and coverage materially influence the detection of tone effects. Overall, our study indicates that while interaction tone can matter in specific interpretive settings, modern LLMs are broadly robust to tonal variation in typical mixed-domain use, providing practical guidance for prompt design and model selection in real-world deployments.
Executive Impact: Key Research Metrics
Understanding the scope and rigor of the evaluation is crucial for enterprise decision-making.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Model Architectures
Details on GPT-4o mini, Gemini 2.0 Flash, and Llama 4 Scout, their design philosophies, and why they were chosen for this study.
Evaluation Methodology
Explanation of the MMMLU benchmark, prompt engineering for tone spectrum (Neutral, Very Friendly, Very Rude), and statistical analysis methods.
Empirical Findings
Presentation of results on tone sensitivity across different LLMs and domains, highlighting statistically significant effects in Humanities tasks vs. STEM.
Implications & Future Work
Discussion of the practical guidance for prompt design and model selection, and potential future research directions.
Enterprise Process Flow
| Model | Humanities Tasks | STEM Tasks |
|---|---|---|
| GPT-4o mini |
|
|
| Gemini 2.0 Flash |
|
|
| Llama 4 Scout |
|
|
Implications for Prompt Engineering
The study reveals that prompt tone sensitivity is both model-dependent and domain-specific. For interpretive settings like Humanities, tone can materially impact accuracy, with neutral or friendly tones generally outperforming rude ones. However, for typical mixed-domain usage and aggregated tasks, modern LLMs demonstrate strong robustness to tonal variation. This provides crucial guidance for designing prompts and selecting models in real-world deployments, suggesting that while specific nuances matter, broad applications are less susceptible to subtle tonal shifts.
- Humanities tasks: Tone effects more pronounced, often statistically significant.
- STEM tasks: Tone effects positive but statistically weaker.
- Gemini 2.0 Flash: Minimal tone sensitivity across all tasks.
Calculate Your Potential AI ROI
Optimizing LLM prompt design for specific domains can significantly enhance operational efficiency by improving accuracy and reducing errors in AI-driven tasks.
Your AI Implementation Roadmap
A phased approach to integrate prompt engineering insights into your enterprise operations.
Phase 1: Initial LLM Selection & Prompt Baseline
Identify target LLMs and establish baseline prompt performance with neutral tone. Select relevant domain-specific tasks from MMMLU.
Phase 2: Tone Variation Experimentation
Systematically test Very Friendly and Very Rude prompt variants across selected LLMs and tasks. Collect accuracy data for each variant over repeated trials.
Phase 3: Performance Analysis & Benchmarking
Analyze pairwise accuracy differences and conduct statistical significance testing. Compare tone sensitivity across model families and domains.
Phase 4: Deployment & Iterative Optimization
Integrate findings into prompt engineering guidelines. Select LLMs based on domain-specific tone robustness. Continuously monitor and refine prompts in production.
Ready to Optimize Your LLM Strategy?
Leverage cutting-edge research to build more robust and effective AI applications for your enterprise.