Skip to main content
Uncategorized

Large Language Models for Propaganda Detection

October 10, 2023

Kilian Sprenkamp

University of Zurich

kilian.sprenkamp@uzh.ch

Daniel Gordon Jones

University of Zurich

danielgordon.jones@uzh.ch

Liudmila Zavolokina

University of Zurich

zavolokina@ifi.uzh.ch

Abstract

The prevalence of propaganda in our digital society poses a challenge to societal harmony and the dissemination of truth. Detecting propaganda through NLP in text is challenging due to subtle manipulation techniques and contextual dependencies. To address this issue, we investigate the effectiveness of modern Large Language Models (LLMs) such as GPT-3 and GPT-4 for propaganda detection. We conduct experiments using the SemEval-2020 task 11 dataset, which features news articles labeled with 14 propaganda techniques as a multi-label classification problem. Five variations of GPT- 3 and GPT-4 are employed, incorporating various prompt engineering and fine-tuning strategies across the different models. We evaluate the models’ performance by assessing metrics such as F 1 score, Precision, and Recall, comparing the results with the current state-of- the-art approach using RoBERTa. Our findings demonstrate that GPT-4 achieves comparable results to the current state-of-the-art. Further, this study analyzes the potential and challenges of LLMs in complex tasks like propaganda detection.

1        Introduction

Propaganda has long been a pernicious tactic of authoritarian regimes, used to manipulate public opinion, legitimize political power, and stifle dissent (Stanley, 2015). Propaganda can create a distorted and often misleading picture of reality that reinforces the authority of the ruling elite and undermines the ability of the population to challenge it. This in turn leads to increased polarization and division, reduced critical thinking skills, and the dehumanization of others. As such, the detection and exposure of propaganda has emerged as a critical task in the digital information ecosystem (e.g. Da San Martino et al. (2020a)).

Recently Large Language Models (LLMs) such as GPT (Brown et al., 2020; OpenAI, 2023a) have shown remarkable capabilities in various NLP tasks. Due to self-supervised pre-training on large amounts of text, LLMs achieve state-of-the-art results in zero and few-shot scenarios (Brown et al., 2020). However, the application of LLMs to non-standard tasks, such as propaganda detection framed as a multi-label classification problem, remains underexplored. Furthermore, the unique potential and challenges of LLMs in such complex tasks have not been thoroughly investigated. Our study aims to fill this knowledge gap by providing structured experiments applying prompt-based learning (Liu et al., 2023a) using multiple versions of GPT-3 and GPT-4.

2        Background

2.1        Propaganda Detection Approaches

Propaganda is the intentional influencing of some- one’s opinion using various rhetorical and psycho- logical techniques (Da San Martino et al., 2020b). Propaganda uses techniques such as loaded language (using words or phrases with strong emotional connotations to influence an audience’s opinion) or flag waving (associating oneself or one’s cause with patriotism or a national symbol to gain support) to manipulate perceptions and shape public attitudes (Da San Martino et al., 2019). Investigative initiatives such as PolitiFact 1 or Belling- cat 2 counter propaganda through manual fact- checking and play a crucial role in exposing propaganda techniques to the public. However, these initiatives often rely on labor-intensive analysis and volunteer research, which is not scalable.

The ongoing discourse on automated propaganda detection revolves around framing the task as either supervised token or multi-label classification problems (Martino et al., 2020; Alam et al., 2022; Al-Omari et al., 2019; Da San Martino et al., 2019).The given labeled datasets focus predominantly on the detection of propaganda within the English language (Al-Omari et al., 2019; Martino et al., 2020). Obtaining labeled data for this purpose is challenging, time and cost consuming, and a largely subjective task (Ahmed et al., 2021). The SemEval-2020 task 11: Detection of propaganda techniques in news articles (Martino et al., 2020), currently stands as the benchmark supervised dataset for propaganda detection. It defines propaganda through 14 techniques outlined in Table 2 within the Appendix, and includes labeled articles specifically regarding these techniques. For the task of multi-label propaganda detection, various models have been developed including shallow learning models (e.g. Random Forest or SVM) (Er- murachi and Gifu, 2020), CNN and LSTM models (Abedalla et al., 2019), as well as BERT-based models (Abdullah et al., 2022; Da San Martino et al., 2020a). For the SemEval-2020 task 11 dataset, the current state-of-the-art model is created by Abdul- lah et al. (2022), which fine-tuned RoBERTa (Liu et al., 2019) with inputs being truncated to 120 to- kens using the Roberta Tokenizer for the given task. However, to the best of our knowledge, the usage of LLMs for propaganda detection has not yet been tested.

2.2        Large Language Models for Text Classification

The advent of LLMs has brought about significant advancements in the field of NLP. LLMs leverage the transformer architecture (Vaswani et al., 2017) and are trained on vast amounts of text (e.g. Gao et al. (2020)) using self-supervision (Yarowsky, 1995; Liu et al., 2023b). Having learned rich lin- guistic features, LLMs can be used for various NLP tasks within zero and few-shot learning settings (Radford et al., 2019; Brown et al., 2020), reducing the need for training data to achieve state-of-the-art results.

The increasing proficiency of LLMs has sparked a rising interest in the area of prompt-based learning (Liu et al., 2023a). Unlike the standard super- vised learning approach, which models P (y|x; Θ), prompt-based learning with LLMs instead models P (x; Θ), using this probability to predict y (Liu et al., 2023a). While transfer learning with LLMs within a standard supervised learning framework can yield impressive results; this is often achieved by adapting the last layer of the neural network according to the task (e.g. Kant et al. (2018); Raffel et al. (2020)). However, our focus will be on exploring the potential of prompt-based learning approaches.

Prompt engineering, a key component of prompt- based learning, is a systematic process of creating prompts based on a variety of established pat- terns to boost the performance of a model (Liu and Chilton, 2022). The two most prevalent strategies for prompting LLMs are the ‘few-shot’ and ‘chain of thought’ pattern (Wei et al., 2022). In the ‘few-shot’ pattern, the prompt features a series of examples regarding x and y. These examples serve as a roadmap, helping to steer the LLM toward the intended task, and presenting an ideal template for input and output formatting (Wei et al., 2022). The ‘chain of thought’ pattern, however, adopts a more prescriptive strategy, using a series of clear instructions to guide the LLM though the process. This method stimulates the LLM to reflect upon its output, often producing more refined and ac- curate results (Wei et al., 2022). Moreover, the ‘chain of thought’ pattern can be used as a tool for explainability, enabling the model to generate an explanation of the given output (Liang et al., 2021). Prompt-based approaches for text classification have been applied recently within various fields such as medicine (Sivarajkumar and Wang, 2022), law (Trautmann et al., 2022), or human resource management (Clavié et al., 2023). Utilizing LLMs via prompt engineering for text classification has the benefit of being highly adaptive towards a given task without the need of transfer learning (Puri and Catanzaro, 2019). However, as the primary functions of LLMs are the analysis and generation of text (Brown et al., 2020), there are numerous factors that determine the model performance. The results of LLMs are not inherently deterministic and standardized. For text classification, this could mean that the output does not fall in the predefined set of categories or is reproducible over multiple inference calls. Furthermore, LLMs have the chance of output repetitions, hallucinations, and unanticipated terminations in output generation due to vanishing gradients (Lee, 2023) or the underlying training data (Ji et al., 2023). A possible solution is proposed by reinforcement learning with human feedback (RLHF) (Christiano et al., 2017), which enables better alignment with the user’s in-

tent (OpenAI, 2023a).

3        Methodology

To analyze the potentials and challenges of LLMs for the task of multi-label text classification in pro- paganda detection we use the SemEval-2020 task 11: Detection of propaganda techniques in news articles dataset (Martino et al., 2020). The dataset is split into 371 training (we kept 20% for validation purposes) and 75 articles for testing. We employ five variations of GPT-3 fine-tuned (Brown et al., 2020) and GPT-4 out-of-the box (OpenAI, 2023a) to detect propaganda techniques within the news articles.

For GPT-4, we employ two distinct types of prompts. In one approach, the model is solely tasked with outputting the labels for a given article, referred to as the ‘base’ prompt. In a complementary approach, we have a ‘chain of thought’ prompt that instructs the model to engage in a reasoning process about the predicted labels. For both types of prompts we apply the ‘few-shot’ pattern, giving a single example for each propaganda technique within the prompt (Table 2).

Additionally, we fine-tuned GPT-3 Davinci with the given dataset and employ the ‘base’ and ‘chain of thought’ prompts at the inference stage. It’s worth noting that the OpenAI API (OpenAI, 2023b) suggests that training only requires defining the in- put and output, without an instructional prompt. To investigate this proposition, we fine-tuned another model without any instruction given at training and inference. Furthermore, for the GPT-3 models, we split the input data into multiple chunks due to the maximum token limit of 2048, com- pared to 8192 for GPT-4. To ensure reproducible outputs, we set the temperature parameter to 0, such that the output becomes more deterministic. We benchmark our five models against the current state-of-the-art approach by Abdullah et al. (2022), which investigated the capabilities of several trans- former (Vaswani et al., 2017) based models, with RoBERTa (Liu et al., 2019) scoring the highest F1 score. We assess the performance of all models using the micro average F1 score, similar to Abdullah et al. (2022), we additionally analyze the micro average Precision and Recall to better understand the working of the employed models. Lastly, we compare F1 scores per label (i.e. the propaganda technique) of each model. All experiments can be replicated through the following anonymized GitHub repository 3.

4        Experiments

The results from our proposed experiments are de- picted in Table 1. It is discernible that both versions of GPT-4 significantly exceed the performance of fine-tuned GPT-3 models, with the GPT-4 ‘base’ model achieving the highest F 1 score among the models we developed.

ModelPrecisionRecallF1 Score
GPT-4 base52.86%64.52%58.11%
GPT-4 chain of thought56.86%57.82%57.34%
GPT-3 base44.35%44.00%44.18%
GPT-3 chain of thought48.62%28.16%35.66%
GPT-3 no instruction47.54%32.48%38.59%
Baseline (Abdullah et al., 2022)NANA63.40%

Table 1: Performance of different models. GPT-3 was fine-tuned, while GPT-4 was employed out-of-the-box

GPT-3. Further analysis revealed a tendency of the fine-tuned GPT-3 models to overfit to the training data, which led to a propensity to predict only those labels that were over-represented in the training set (for instance, Loaded_Language). Additional scrutiny of the GPT-3 models’ output uncovered issues such as repetitions, hallucinations, and unanticipated terminations in output generation, even when the maximum token limit had not been met; indicating that the models can suffer from vanishing gradients. This issue is particularly conspicuous when comparing the output from GPT-3 and GPT-4 in the ‘chain of thought’ models. While GPT-4 generated plausible reasoning for each classified label, GPT-3 was unable to do so.

GPT-4. Comparing the two GPT-4 models, we noticed that the ‘chain of thought’ prompt led to a higher Precision, suggesting that the model is more discerning and tends to classify a label only when it can provide sound reasoning for it. In contrast, the GPT-4 ‘base’ model displayed a higher Recall, indicating a more liberal approach to pre- diction, resulting in a better capture of positive instances, albeit at the cost of a higher number of false positives. These disparities are further illuminated when comparing the F 1 score for each label, as shown in Table 3 in the Appendix. For instance, the GPT-4 ‘chain of thought’ model scored an F 1 Score of 0% for five propaganda techniques, implying an inability to reason about these concepts.

Baseline Comparison. While the F 1 scores of both GPT-4 models are similar, they do not sur- pass the current state-of-the-art method based on RoBERTa (Abdullah et al., 2022). Nevertheless, when we juxtaposed the F 1 Scores per label of the GPT-4 ‘base’ model with the method of Abdullah et al. (2022), our model demonstrated superior performance for seven out of 14 propaganda techniques.

5        Discussion

Based on the experiments, we identified potentials and challenges of LLMs for propaganda detection in multi-label classification problems.

Potentials. Our results show that prompt-based learning, especially GPT-4, can yield results com- parable to traditional supervised-learning methods. This is advantageous as it significantly re- duces the explicit need for labeled training data and training resources. Further, by setting the temperature parameter to 0, we reduce the fear of non-deterministic results. With the ongoing development of the OpenAI API, reproducibility will even increase as the output can be predefined within functions (OpenAI, 2023c). Moreover, our ‘chain of thought’ approach elucidates the reasoning be- hind flagging a particular text as propaganda. As an extension of Abdullah et al. (2022), we enable the model to classify texts and generate its own interpretations. However, potential dangers arise from this reasoning, as it may result from hallucinations (Lee, 2023; Ji et al., 2023) caused by the LLM, posing a risk of over-reliance by the end user.

Moreover, GPT-4 being superior in the under- standing of linguistic features over GPT-3 proves beneficial for the subjective task of propaganda detection, where general knowledge outweighs the need for fine-tuning. We wish to highlight RLHFs crucial role in driving the performance difference between the GPT-3 and GPT-4 models, as it was only applied to GPT-4 (Brown et al., 2020; OpenAI, 2023a).

Regarding the usage of GPT-4, while the ‘base’ model performs slightly better on Recall and the F 1 score, the ‘chain of thought’ model boasts higher Precision. This highlights that the model’s prediction tendency can be adjusted by refining the prompt instruction, depending on the importance of Precision or Recall in the specific task.

Challenges. Our research indicates that GPT-3 models suffer from vanishing gradients, leading to repetitions, hallucinations, and abrupt stops within the output, making them unsuitable for propaganda detection (Lee, 2023; Ji et al., 2023). Additionally, with just 371 articles available for training, the data seems insufficient to learn the complex nuances of different propaganda techniques and tends to overfit the data.

Additionally, we see the maximum token length as a potentially limiting factor for successful propaganda detection within newspaper articles. Given that our prompt instruction in its ‘base’ form is~ 740 token long, this leaves just 1308 tokens in the case of GPT-3 and 7452 in the case of GPT-4, for the article that we want to analyse and the completion returned by the model. However, the maximum token length is much higher compared to Abdullah et al. (2022), with a maximum token length of 140. For multi-label classification prob- lems the number of tokens used within prompt in- struction will scale approximately linearly as each label would need a description and possibly an ex- ample, depending on the underlying task.

Furthermore, the performance of GPT-4 varies significantly across different propaganda techniques compared to the baseline model (Abdullah et al., 2022) (Table 3). Some propaganda techniques (e.g., Bandwagon,Reductio_ad_hitlerum) are likely not well-represented in the model’s training data, which is not disclosed in OpenAI (2023a). To address this issue, three strategies can be considered. Firstly, fine-tuning GPT-4 with our propaganda dataset, which may be feasible in the near future. Secondly, adapting the given prompt that currently relies solely on the definitions and examples of propaganda techniques by (Da San Martino et al., 2019; Abdullah et al., 2022). Thirdly, considering the subjective nature of propaganda classification, we could modify the classification schema to enhance the LLM’s comprehension. This proposition has far-reaching implications for task formulation in NLP.

6        Future Work

We presented a first analysis of how LLMs can be used for propaganda detection. Our contribution is two-fold. First, we show that LLMs, especially GPT-4, achieve comparable results to the current state-of-the-art in the task of propaganda detection. Second, we derive a preliminary analysis of potentials and challenges of LLMs for multi-label classification problems. Our aim is to expand on this analysis within future publications with the goal of creating an LLM-based tool paralleling the approach of Da San Martino et al. (2020a), yet uniquely integrating the use of LLMs, which can be used by the wide public to flag and explain propaganda in text.

Limitations

Our manuscript has several limitations.

First, we applied propaganda detection exclusively to news articles in the English language. We identified two major contributors that determine the applicability of the method to other languages: language morphology and the representation of each language within the training data of GPT-3 (Brown et al., 2020) and GPT-4 (OpenAI, 2023a).

Second, we did not use any open-source models like LLaMA (Touvron et al., 2023), which would enable more control over the training and inference process.

Third, we did not use the function call options within OpenAI API (OpenAI, 2023c), we aim to do so in further model iterations.

Last, while our experiments have not been limited by any computational budget, i.e. cost of the OpeanAI API OpenAI (2023b) we see this as potential limiting factor for future large scale resource.

Ethics Statement

Regarding our manuscript, we see a general positive broader impact as it forms the basis for more effective detection of propaganda beneficial for society. However, we would like to state that limitations like misclassifications or hallucinations can impact the performance of LLMs for propaganda detection and have a negative impact on end users. Moreover, we fear that the application of automated propaganda detection can lead to an over-reliance on the end user side.

References

Malak Abdullah, Ola Altiti, and Rasha Obiedat. 2022. Detecting propaganda techniques in english news ar- ticles using pre-trained transformers. In 2022 13th International Conference on Information and Com- munication Systems (ICICS), pages 301–308. IEEE.

Ayat Abedalla, Aisha Al-Sadi, and Malak Abdullah. 2019. A closer look at fake news detection: A deep learning perspective. In Proceedings of the 2019 3rd international conference on advances in artificial intelligence, pages 24–28.

Alim Al Ayub Ahmed, Ayman Aljabouh, Praveen Ku- mar Donepudi, and Myung Suh Choi. 2021. Detect- ing fake news using machine learning: A systematic literature review. arXiv preprint arXiv:2102.04458.

Hani Al-Omari, Malak Abdullah, Ola Al-Titi, and Samira Shaikh. 2019. Justdeep at nlp4if 2019 shared task: Propaganda detection using ensemble deep learning models. EMNLP-IJCNLP 2019, page 113.

Firoj Alam, Hamdy Mubarak, Wajdi Zaghouani, Giovanni Da San Martino, and Preslav Nakov. 2022. Overview of the wanlp 2022 shared task on propaganda detection in arabic. arXiv preprint arXiv:2211.10057.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.

Benjamin Clavié, Alexandru Ciceu, Frederick Naylor, Guillaume Soulié, and Thomas Brightwell. 2023. Large language models in the workplace: A case study on prompt engineering for job type classification. In International Conference on Applications of Natural Language to Information Systems, pages 3–17. Springer.

Giovanni Da San Martino, Shaden Shaar, Yifan Zhang, Seunghak Yu, Alberto Barrón-Cedeno, and Preslav Nakov. 2020a. Prta: A system to support the analysis of propaganda techniques in the news. In Proceed- ings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 287–293.

Giovanni Da San Martino, Shaden Shaar, Yifan Zhang, Seunghak Yu, Alberto Barrón-Cedeno, and Preslav Nakov. 2020b. Prta: A system to support the analysis of propaganda techniques in the news. page 287–293.

Giovanni Da San Martino, Seunghak Yu, Alberto Barrón-Cedeno, Rostislav Petrov, and Preslav Nakov. 2019. Fine-grained analysis of propaganda in news article. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 5636–5646.

Vlad Ermurachi and Daniela Gifu. 2020. Uaic1860 at semeval-2020 task 11: detection of propaganda techniques in news articles. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pages 1835–1840.

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Ho- race He, Anish Thite, Noa Nabeshima, et al. 2020. The pile: An 800gb dataset of diverse text for lan- guage modeling. arXiv preprint arXiv:2101.00027.

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea

Madotto, and Pascale Fung. 2023. Survey of halluci- nation in natural language generation. ACM Comput- ing Surveys, 55(12):1–38.

Neel Kant, Raul Puri, Nikolai Yakovenko, and Bryan Catanzaro. 2018. Practical text classification with large pre-trained language models. arXiv preprint arXiv:1812.01207.

Minhyeok Lee. 2023. A mathematical investigation of hallucination and creativity in gpt models. Mathe- matics, 11(10):2320.

Zhengzhong Liang, Steven Bethard, and Mihai Surdeanu. 2021. Explainable multi-hop verbal reasoning through internal monologue. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, pages 1225–1250.

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023a. Pre- train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35.

Vivian Liu and Lydia B Chilton. 2022. Design guide- lines for prompt engineering text-to-image generative models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pages 1– 23.

Xiao Liu, Fanjin Zhang, Zhenyu Hou, Li Mian, Zhaoyu Wang, Jing Zhang, and Jie Tang. 2023b. Self- supervised learning: Generative or contrastive. IEEE Transactions on Knowledge and Data Engineering, 35(1):857–876.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.

G Martino, Alberto Barrón-Cedeno, Henning Wachsmuth, Rostislav Petrov, and Preslav Nakov. 2020. Semeval-2020 task 11: Detection of propaganda techniques in news articles. arXiv preprint arXiv:2009.02696.

OpenAI. 2023a. Gpt-4 technical report.

OpenAI. 2023b. Openai API. [Online; accessed 2023- 06-15].

OpenAI. 2023c. Function calling and other API updates. [Online; accessed 2023-06-01].

Raul Puri and Bryan Catanzaro. 2019. Zero-shot text classification with generative language models. arXiv preprint arXiv:1912.10165.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text trans- former. The Journal of Machine Learning Research, 21(1):5485–5551.

Sonish Sivarajkumar and Yanshan Wang. 2022. Health- prompt: A zero-shot learning paradigm for clin- ical natural language processing. arXiv preprint arXiv:2203.05061.

Jason Stanley. 2015. How propaganda works. In How propaganda works. Princeton University Press.

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.

Dietrich Trautmann, Alina Petrova, and Frank Schilder. 2022. Legal prompt engineering for multilingual legal judgement prediction. arXiv preprint arXiv:2212.02199.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.

David Yarowsky. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In 33rd annual meeting of the association for computational linguistics, pages 189–196.

Appendix

Propaganda TechniqueDefinitionExample
Appeal_to_AuthoritySupposes that a claim is true because a valid authority or expert on the issue supports it“The World Health Organisation stated, the new medicine is the most effective treatment for the disease.”
Appeal_to_fear-prejudiceBuilds support for an idea by instilling anxiety and/or panic in the audience to- wards an alternative“Stop those refugees; they are terror- ists.”
Bandwagon,Reductio_ad_hitlerumJustify actions or ideas because every- one else is doing it, or reject them be- cause it’s favored by groups despised by the target audience“Would you vote for Clinton as presi- dent? 57% say yes.”
Black-and-White_FallacyGives two alternative options as the only possibilities, when actually more op- tions exist“You must be a Republican or Demo- crat”
Causal_OversimplificationAssumes a single reason for an issue when there are multiple causes“If France had not declared war on Ger- many, World War II would have never happened.”
DoubtQuestioning the credibility of someone or something“Is he ready to be the Mayor?”
Exaggeration,MinimisationEither representing something in an ex- cessive manner or making something seem less important than it actually is“I was not fighting with her; we were just playing.”
Flag-WavingPlaying on strong national feeling (or with respect to a group, e.g., race, gen- der, political preference) to justify or promote an action or idea“Entering this war will make us have a better future in our country.”
Loaded_LanguageUses specific phrases and words that carry strong emotional impact to affect the audience“A lone lawmaker”s childish shouting.”
Name_Calling,LabelingGives a label to the object of the propa- ganda campaign as either the audience hates or loves“Bush the Lesser.”
RepetitionRepeats the message over and over in the article so that the audience will accept it“Our great leader is the epitome of wis- dom. Their decisions are always wise and just.”
SlogansA brief and striking phrase that contains labeling and stereotyping“Make America great again!”
Thought-terminating_ClichesWords or phrases that discourage critical thought and useful discussion about a given topic“It is what it is”
Whataboutism,Straw_Men,Red_HerringAttempts to discredit an opponent’s po- sition by charging them with hypocrisy without directly disproving their argu- ment“They want to preserve the FBI’s reputa- tion.”

Table 2: Propaganda Techniques after Da San Martino et al. (2019); Abdullah et al. (2022)

Propaganda TechniqueGPT-4 baseGPT-4 chain of thoughtGPT-3 baseGPT-3 chain of thoughtGPT-3 no instructionBaseline
Appeal_to_Authority23.52%19.05%11.11%16.67%0.00%47.36%
Appeal_to_fear-prejudice52.50%0.00%8.70%0.00%0.00%43.6%
Bandwagon,Reductio_ad_hitlerum0.00%0.00%0.00%0.00%0.00%4.87%
Black-and-White_Fallacy54.54%0.00%12.5%0.00%10.00%24.09%
Causal_Oversimplification32.55%50.00%0.00%0.00%0.00%19.44%
Doubt52.63%54.54%16.98%19.42%16.22%61.36%
Exaggeration,Minimisation64.00%64.00%16.28%8.22%0.00%33.03%
Flag-Waving32.00%0.00%43.48%0.00%26.96%61.49%
Loaded_Language92.75%93.62%73.51%71.39%70.02%75.71%
Name_Calling,Labeling77.67%74.29%54.96%31.32%50.00%67.49%
Repetition56.67%65.79%31.88%16.00%17.78%31.14%
Slogans9.09%10.00%21.74%21.74%6.67%54.90%
Thought-terminating_Cliches20.00%0.00%0.00%0.00%0.00%25%
Whataboutism,Straw_Men,Red_Herring14.81%33.33%0.00%0.00%0.00%20.83%

Table 3: F1 Scores per Propaganda Technique. Note: The Baseline Abdullah et al. (2022) solely provide F 1 Scores for a test data set of which the labels are not publicly available, thus only a limited comparison is possible.