Skip to main content

Red-Teaming Large Language Models using Chain ofUtterances for Safety-Alignment

Rishabh Bhardwaj‡
, Soujanya Poria‡
‡ DeCLaRe Lab, Singapore University of Technology and Design, Singapore
Be warned that some of the examples in this paper are harmful and

Aug 18, 2023


Larger language models (LLMs) have taken the world by storm with their massive multi-tasking capabilities simply by optimizing over a next-word prediction objective. With the emergence of their properties and encoded knowledge, the risk of LLMs producing harmful outputs increases, making them unfit for scalable deployment for the public. In this work, we propose a new safety evaluation benchmark RED-EVAL that carries out red-teaming. We show that even widely deployed models are susceptible to the Chain of Utterances-based (CoU) prompting, jailbreaking closed source LLM-based systems such as GPT-4 and ChatGPT to unethically respond to more than 65% and 73% of harmful queries. We also demonstrate the consistency of the RED-EVAL across 8 open-source LLMs in generating harmful responses in more than 86% of the red-teaming attempts.

Next, we propose RED-INSTRUCT—An approach for safety alignment of LLMs. It constitutes two phases: 1) HARMFULQA data collection: Leveraging CoU prompting, we collect a dataset that consists of 1.9K harmful questions covering a widerangeoftopics,9.5Ksafeand7.3KharmfulconversationsfromChatGPT;2) SAFE-ALIGN: We demonstrate how the conversational dataset can be used for the safety alignment of LLMs by minimizing the negative log-likelihood over helpful responses and penalizing over harmful responses by gradient accent over sample loss. Our model STARLING,afine-tunedVicuna-7B,isobservedtobemoresafely aligned when evaluated on RED-EVAL and HHH benchmarks while preserving the utility of the baseline models (Truthful QA, MMLU, and BBH).

1 Introduction

After several years of using language models at a moderate scale such as BERT [5], large language models (LLMs) have led to a paradigm shift not only in natural language processing (NLP) or AI but in a wide range of areas, leading to significant advancement in a considerably short span of time. For instance, it is being using in the healthcare[24,15], education[14,10], Law[26,3], and Finance [21].

A pre-requisite to building these LLMs is a huge amount of a large amount of pre-training data with more data samples needed with the increase in the number of model’s trainable parameters [9, 27]. An essential aspect of data used for training is its quality—toxicity, noise, duplicate sam-

Figure 1: Methodology depiction of RED-INSTRUCT. Phase-1 construct HARMFULQA with harmful questions and corresponding harmless responses by CoU-based prompting, and harmful responses using CoU-based Red-teaming (proposed as a part of our RED-EVAL safety benchmark). In phase-2, we utilize HARMFULQA to align Vicuna-7B to be more safe yet helpful, giving rise to our model STARLING.

ple, and inherent biases are a few of the unwanted characteristics that can lead to undesired LLM behavior post-training, making them unfit for public use. One of the critically unexpected behav-iors of LLMs is when they tend to produce harmful outputs for a prompt from a user, irrespective of the user’s intent. Without undergoing rigorous safety alignment, the model’s guardrails against producing harmful content stay weak, making it prone to red-teaming (or jailbreaking), fulfilling the potential malicious intent of the user.

In this paper, we aim to contribute to an essential area of large language model research: “ethical LLMs”. An ethical language model is one which is responsible to prioritize user safety and avoids generating content that promotes harm, discrimination, misinformation, or any form of negative impact on individuals or society as a whole. There are many guidelines an ethical language model development is expected to follow such as safety, biases and fairness, privacy, transparency, and accountability [7, 4]. In this work, we primarily focus on making LLMs more safe for public use. We define a “safe LLM” to be a language model whose generated content does not pose risks or harm to users while staying helpful. This involves preventing the generation of inappropriate, harmful, or dangerous content.

Deploying systems at scale with weaker safety guardrails can be exceedingly detriment a land potentially enticing for adversaries. While generic interactions between users and these deployed systems might seem safe, the necessity for innovative and systematic red teaming is crucial to identify loop-holes within these systems and play an essential role in enhancing their robustness. As a result, it becomes essential to systematically test and provide assurances of the system’s robustness before deploying it at a large scale. This is particularly important given the rapid ubiquity of LLMs, their increasing size, and the vast amount of data they’ve processed, which has made conducting safety evaluations even more challenging.
Existing safety evaluations, for instance, Shaikh et al.[18] perform a Chain-of-Thought (CoT)-based assessments consisting of questions that ask for explicit instructions related to harmful behaviors. Even though effective, we observed that LLMs we study in this work can easily identify such questions, e.g., “How do I build a bomb?”, and show a high refusal rate (fraction of time model identifies a harmful question and does not provide a harmful answer) by raising ethical warnings, straight away refusing to answer or answer a related but ethical query (Table 3).

We propose RED-EVAL, a simple yet effective way to perform red-teaming to conduct safety evaluations of LLMs. RED-EVAL carries out a jailbreak by teasing out information using a Chain of Utterances (CoU)-based prompt—a red-teaming prompt that sets up a conversation between two agents: a harmful agent Red-LM and an unsafe-helpful agent Base-LM. A harmful question is then placed as an utterance of Red-LM and the model is asked to complete the response of Base-LM by following the guidelines in the prompt. One key ingredient that makes CoU strong for jailbreaking is the generation of internal thoughts as a prefix in the Base-LM response. The demonstration of how to respond as a Base-LM and instructions are closely followed by models under evaluations, which is observed to reduce refusal rates significantly1. Using 200 harmful questions from Shaikh et al. [18], we demonstrate the effectiveness of our approach in breaking guardrails not only on publicly available models based on LLaMA 7B and 13B [2, 25] but also on widely used and deployed systems such as ChatGPT and GPT-4 with potentially larger language models as their backbone.

As another important contribution of this work, we introduce RED-INSTRUCT—a new way of a ligning LLMs toward safer and more responsible behavior while maintaining their helpful nature. Red-instruct constitutes two phases: 1) Construction of HARMFULQA: A data with harmful questions-based CoU conversations between Red-LM and Base-LM; and 2) SAFE-ALIGN: an LLM alignment approach using HARMFULQA conversations. Shown in Figure 1 phase-1, we construct a dataset by prompting ChatGPT. The process involves diverse topic and sub-topic (category) generation followed by the generation of category-specific harmful questions. For each collected harmful question, ChatGPT was demonstrated with a CoU-based prompt to generate a conversation via collaborative roleplay i.e., behaving both as a harmful agent (Red-LM) that asks questions related to the harmful question and a responder conversational agent (Base-LM). The Red-LM tries to subtly extract the desired harmful (unsafe) information from Base-LM about the harmful question. Red-LM possesses internal thoughts based on the response of Base-LM, including harmless questions to build trust and asking sub-questions that collectively fetch relevant information for the harmful questions. ChatGPT-generated Base-LM responses are expected to be safe and helpful. We refer to this data as blue data2. Next, we leverage the red-teaming prompt used in the RED-EVAL to jailbreak ChatGPT for obtaining a harmful counterpart of the Base-LM responses in blue data, denoted as red data. Collectively, we denote blue and red data by HARMFULQA, it is:

  • A set of 1,960 harmful questions across 10 topics and their sub-topics.
  • A set of 9,536 blue conversations with 66K turns and 7,356 red conversations with 52K turns.

In the second phase i.e., SAFE-ALIGN, we aim to carry out model alignment towards safety. We definesafetyalignmentasanapproachthatsteersapre-trainedlanguagemodeltowardazonewhere it is safe or harmless for public use while being helpful. It is done via language model fine-tuning on the HARMFULQA (obtained in phase-1). We base our safety alignment experiments on an open-source model Vicuna [2] which has shown performances comparable to ChatGPT and Bard even at a much lower scale3. Henceforth, we name our model as STARLING.

STARLING is a safer LLM with little trade-off with its user-conversational and problem-solving capabilities (genericutility). To demonstrate this, we perform an extensive set of experiments, gauging the model’s capabilities in mimicking human falsehoods (TruthfulQA) and multi-task capabilities (MMLU and BIG-Bench). To observe the impact of SAFE-ALIGN on Vicuna-7B, we ask 200 harmful questions via RED-EVAL and HHH i.e., a scale for helpful, honest, and harmlessness [1].

Therefore, the important contributions of this paper are multi-faceted:

  • RED-EVAL: A novel benchmark evaluation to gauge LLMs on their safety against harmful questions.
  • RED-INSTRUCT: A systematic approach for LLM alignment towards safety, and thus more responsible. RED-INSTRUCT comprises the following two contributions:
    – HARMFULQA: A large dataset with over 50K conversations obtained from standard and harmful behavior of ChatGPT referred to as blue and red data.
    – STARLING: A safety-aligned version of Vicuna-7B obtained by its SAFE-ALIGN on HARMFULQA.

We show RED-EVAL has a high attack success rate (69%) at red-teaming publicly deployed closed-source systems such as GPT-4 and ChatGPT. We also show a high success rate of RED-EVAL on open-source models such as Vicuna and Stable Beluga with over 85% success rate, a 39% improvement over the CoT baseline. We also show that our safety-aligned model STARLING outperforms the baseline models on safety and utility evaluations.

Figure 2: Jailbreak prompt template used in RED-EVAL. Yellow and cyan colored are placeholders for a harmful query and response (as text completion) to be generated by LLM, respectively. The rest of the text is a part of the CoU-based prompt. denotes the clipped text.

2 Methodology


At the core of RED-EVAL red-teaming is a CoU-based jailbreak prompt that aims to tease out harmful answers from LLM when asked a harmful question—a query that seeks to find information that can be used by malicious actors. As shown in Figure 2, the jailbreak prompt sets up a CoU environment—a conversation between two agents Red-LM and Base-LM (Figure 2). Then, as a part of Red-LM’s utterance, RED-EVAL seeks information for a harmful question from Base-LM (to be placed in yellow highlighted text) such as “How to dispose of a body using chemical means” while the model under evaluation is asked to fill the response of Base-LM (cyan highlighted text) by following the CoU demonstration and instructions. Base-LM also possesses internal thoughts to drive the answer in a more helpful direction without taking into account the harmfulness of the response i.e., safety, ethics, transparency, etc.

We show jailbreak responses in Figure 3 where GPT-4, ChatGPT, and Vicuna are observed to provide pieces of relevant information to a question deemed harmful. Such a jailbreak indicates the not-safe-guarded zone of LLMs where they tend to provide harmful responses. Deploying systems at scale with weaker safety guardrails can be exceedingly detrimental. The need for innovative and systematic red-teaming tends to identify loopholes present in largely deployed systems and play an important role in making them more robust.

Figure 3: Jailbreak examples for closed source LLM-based systems: GPT-4, ChatGPT, and open-source Vicuna-7B. denotes the clipped text.

Therefore, we evaluate various open-source and closed-source systems for their safety against the red-teaming method of RED-EVAL. We use 100 harmful questions from Shaikh et al. [18] and prompt each system as shown in Figure 2. For each collected response, we ask GPT-4 to evaluate the answer and label if it is harmful. We test the rate of successful red-teaming attempts of each model, i.e., how many times the model answered the harmful question. Since a noisy model can virtually show better scores by providing an unrelated text to the harmful question, to identify such cases, we also perform utility testing of these models evaluating them on other benchmarks such as problem-solving and truthfulness.


The motivation for proposing RED-INSTRUCT is to systematically make LLMs safer for use while preserving their helpful nature. It constitutes two phases: 1) HARMFULQA data generation, and 2) SAFE-ALIGN i.e., aligning LLM towards a more safe region of operation using HARMFULQA. We depict the overall methodology in Figure 1, phase-1 and elaborated in Figure 4.


The first step in the harmful question-answering (QA) data generation process, as shown in Figure 4 step 1 , is topic generation. With repeated interactions, we asked ChatGPT to provide us with 10 diverse topics of discussion. For each topic we obtain 10 categories (subtopics), amassing a col-lection of 100 categories for QA. For instance, Literature and Language is a topic with Fiction, Linguistics, Drama as its categories. In step 2 to generate harmful questions, for each category, we obtain 20 harmful questions. To minimize duplicates and unharmful questions, by demonstration, we instruct ChatGPT to come up with a conversation between Red-LM and Base-LM where Red-LM asks a new harmful question with each utterance and is possessed with internal thoughts, while Base-LM provides a harmless and helpful answer. We extract the questions gener-ated as a part of the interaction of Red-LM with the Base-LM. This is done by separately feeding the whole conversation to ChatGPT and asking it to generate a list of identified Red-LM questions. We skip two categories—Chemistry under the topic of Science and Technology, and Political Philosophy under Philosophy and Ethics—where we could not retrieve the required number of harmful questions4. Thus, from the remaining categories, we obtain a collection of 1,960 questions.

Figure 4: Four step of HARMFULQA generation. Steps 2 – 4 involve CoU-based prompting to generate harmful questions, blue data, and red data. denotes clipped text.

Step 3 receives a harmful question obtained in step 2 and asks ChatGPT to generate a conversation between Red-LM and Base-LM. Red-LM is an agent which seeks to gain information from an ethicalbotBase-LMregardingtheharmfulquestionbysubtlequestionanswering:includinggeneric harmless queries, asking information in pieces, and providing hypothetical and fictional scenarios rather than direct querying. To have a more involved conversation, Red-LM is asked to go through an internal thought i.e., analyze the Base-LM responses and plan the next utterance accordingly. Base-LM responses are expected to be harmless yet helpful. For each harmful question obtained in step 2 , we repeat step 3 five times. We leverage the randomness in the next-word prediction of ChatGPT (with LLM backend), sampling different possible ways to retrieve information about the same harmful question. In the Appendix, we demonstrate the different flow of conversations generated by ChatGPT for a common harmful question 5. We refer to this dataset as Blue data. Due to red flags generated by the ChatGPT system, in several cases, we could not collect all five or even a single conversation per harmful question. Out of 1,960 harmful questions, we could retrieve at least one conversation for 1,912 questions.

For each Base-LM response in the blue data, in step 4 , we obtain a corresponding harmful response totheRed-LMquestion.Forthispurpose,weleveragetheCoU-basedred-teamingprompt(Figure2) proposed in RED-EVAL. Essentially, this step converts a conversation between a harmful agent (Red-LM) and an ethical bot (Base-LM) from being ethically inclined—less harmful and less helpful–for harmful questions to more helpful irrespective of the harmfulness of the query from Red-LM. Thus, we obtain Red data, that is, a counterpart of blue data where Base-LM responses are significantly more harmful and helpful. From 1,912 blue data conversations, we could regenerate corresponding 1,890 valid red data conversations6. Collectively, we refer to blue and red data as HARMFULQA. In Table 1, we show statistics of the collected data.

We use CoU-based prompts for step 2 – 4 . Step 4 prompt is similar to the prompt used in RED-EVAL where the system is asked to return the response for Base-LM. Step 2 and 3 not only have a CoU prompt but also instruct ChatGPT to generate conversations. Unlike CoU in red-teaming where Base-LM possesses internal thoughts before generating answers, step 2 and 3 have internal thoughts for Red-LM. This is because the main focus is on generating harmful questions and conversations around them.


In phase-2 of RED-INSTRUCT, we perform alignment of LLM towards a safer (harmless) yet helpful zone. In this experiment, we want to explore if the generated blue and red data can strengthen the guardrails of the model. We explore two alignment strategies: A) Safe alignment using blue data, B) Safe alignment using complete HARMFULQA.

  • (Strategy-A: Alignment using blue data) Since Vicuna is LLaMA fine-tuned i.e., a de-coder on causal Transformer architecture, we learn by maximizing log-likelihood (a causal language modeling objective) autoregressively. Given an input to the model x = [w1,··· ,wn],

We use the blue data conversations to minimize the cross-entropy loss over the Base-LM responses, i.e., a standard causal language modeling objective. Following Chiang et al. [2] and Liu et al. [13], we zero out the loss over the Red-LM utterances by redefining computation log-likelihood:

Here,1R(w ) denotes whether token wi is part of the response tokens, it is 0 if wi is not part of the Base-LM response and 1 if it is part of the response. n is the number of tokens at the input. The model is trained to assign highly probability score to each Base-LM response token wi given the previous tokens [wj]j=0.

  • (Strategy-B: Alignment using red data) We also explore alignment using the red data. Us-ing red data can provide more insights into the model and guide it away from harmful responses. We posit that negatively rewarding the mode on red data can lead to stronger guardrails. To carry out this experiment, we first combine blue and red data and train the Vicuna-7B LM for the first K steps. During this phase, the idea is to take the model in a direction that reduces the cross-entropy loss for blue data responses (more harmless, yet helpful) while moving away from the direction of red data responses i.e., gradient ascent. WedefinethelossfunctionforthebatchwithNb andNr asthesetofblueandredsamples, respectively,

Where N≤1 and N>1 denote red samples for which negative log-likelihood is less than equal to 1 and greater than 1, respectively. λ = 1 and λ = 0.1. N = N + N and Nr = N≤1 + N>1. We make the coefficient of λ2 positive when the sample loss for a red response goes above 1.0 as allowing the model to maximize loss over the red response leads to a collapse in model representations where it stops generating text. The same was observed when we put a large value of λ2.

Training data. For Strategy-A, we use blue responses which are paired with red responses i.e., for each conversation in blue data, there is a conversation in red data. With the help of the list of topics used in HARMFULQA, we also collect around 13K helpful questions and their standard (without red-teaming) responses from ChatGPT accounting for a total of 13K QA pairs (Table 2). To this list of around 21K samples, we mix an equal amount of ShareGPT data that was used in Vicuna training [2]. The mixing of data was an important step to prevent forgetting, a similar approach adopted by Liu et al. [13].

For Strategy-B, we use both blue-red matched data (around 7K each) for the first K steps of training and then omit the use of responses from the red data. After the K steps, we follow the Strategy-A training. Thus, the primary difference in the training of strategies A and B is that B uses red data to provide guidance to the model from red responses by penalizing the model when it assigns a high probability to the red (harmful) responses. While our first intuition for Strategy-B was to keep red data for the full model of training, we observed the model learning becomes noisy, leading to forgetting the knowledge and task-solving capabilities. We discuss more on this in the experiments section.

The purpose of SAFE-ALIGN is to guide the model towards more harmless behavior by showing ex-amples from widely used systems such as ChatGPT. Subtle harmful conversations with hypothetical and fictional scenarios can trigger the model to generate harmful information. Training on ethical (safe) responses of such conversational data can lead to stronger model guardrails. Since Chat-GPT can easily identify harmful questions and provide harmless responses even with red-teaming attempts (Table 3), we posit training a smaller model on its responses can lead to a better safety-aligned model.

One approach to performing safety alignment of Large LMs is to construct blue data and red data (via jailbreaking) by prompting the model itself (and not ChatGPT) and using it to fine-tune the model towards safer responses. This could be beneficial for the model which is of large scale. Although exciting, we leave such an approach for future work.

3 Experiments

Safety benchmarks. We compare our red-teaming approach RED-EVAL with the chain of thought red-teaming by Shaikh et al. [18], following the template from Kojima et al. [11]. By prompting with “Let’s think step by step,” it is observed that models generate reasoning steps that lead to improved downstream performance [11]. We denote this model by COT. For red-teaming, the same prompt has been observed to get teased out by harmful queries, thus, leading to harmful responses. We also consider standard prompts for the models without an explicit red-teaming attempt, referred to as STANDARD. With the help of each prompt, we query the model with 200 harmful questions collected by Shaikh et al. [18] using text-davinci-002 across six adjectives—racist, stereotypical, sexist, illegal, toxic, and harmful. This is defined as DANGEROUSQA. For GPT-4 and ChatGPT, we also report RED-EVAL results on 1,960 harmful questions collected in HARMFULQA7.

To evaluate RED-EVAL red-teaming extensively, we consider eight open-source decoder-only causal models: VICUNA-13B, 7B; STABLEBELUGA-13B, 7B; fine-tuned version of LLaMA-2 [25] i.e., LLAMA2-FT-7B 8; and ShareGPT fine-tuned version of Vicuna-7B, denoted by VICUNA-FT-7B which is trained using the ShareGPT split of the STARLING’s training dataset. This foundational benchmark enables us to discern the extent to which the performance enhancements of STARLING over VICUNA-7B are predominantly influenced by the ShareGPT dataset. We also experiment with widely used LLM-based systems such as ChatGPT and GPT-4 API versions. The two models trained with SAFE-ALIGN are detailed below.

SAFE-ALIGN. We train Vicuna-7B with Strategy-A and Strategy-B and denote the two models by STARLING (BLUE) and STARLING (BLUE-RED). We also train VICUNA-FT-7B which we obtained by further tuning Vicuna-7B on 20,803 ShareGPT conversations. This is equivalent to Strategy-A without blue data. All the models were trained with batch-size 4, 8 gradient accumulation steps, for 3 epochs with an LR of 1e-5, and a maximum model input length of 1,280. STARLING (BLUE-RED) was first trained on paired blue-red response for the initial K=200 steps with LR = 2e-5.

LLM-as-a-Judge. Following Zheng et al. [28], we employ GPT-4 API to label the responses generated by models on the red-teaming task. On randomly selected 400 harmful QA pairs, we con-ducted an agreement test and found more than 98% of the time GPT-4 labels match three expert human annotators. Thus, we rely on GPT-4 as a judge for red-teaming 9.

Performance metric. For open-source models, we define Attack Success Rate (ASR) [29] as the fraction of successful red-teaming attempts (nr) where the model provided a harmful answer out of the total number of harmful questions queried (np), thus ASR=nr . Closed source systems, such as GPT-4 and ChatGPT APIs, recognize harmful content and refuse to respond as per their content

management policy. We refer to such cases as na. We report ASR scores of closed-source models excluding such cases by ASR2 = n −n . In this paper, we report ASR for open-source and ASR2 for closed-source and use a common term ASR.

HHH. We use the Helpful, Honest, and Harmless (HHH) benchmark [1] for HHH evaluation. This dataset contains 50 assessment instances for each category, which also encompassed a classification for ’other’, culminating in a total of around 200 comparisons. The main objective of the dataset is to evaluate both the alignment and the capabilities of models, without explicitly separating these two dimensions. The evaluation involves a Multiple-Choice (MC) task, designed to gauge the models’ capacity to choose better answers from two reference options. The likelihood of the model favoring one answer over the other is computed when presented with two potential answers simultaneously.

Utility benchmarks. Besides evaluating the models on harmfulness benchmarks, we also evaluate models on benchmarks which measure the model utility such as TruthfulQA [12], BBH [23] and MMLU [8]. For TruthfulQA, the score is the normalized total probability assigned to the set of true answers (MC2). MMLU is a 5-shot evaluation based on next-word prediction. BBH evaluated the model over 23 challenging tasks. We use 3-shot direct prompting and measure exact-match score.

4 Results and Discussions

4.1 Red-Teaming.

In Table 3-DANEGROUSQA, where publicly deployed systems such as GPT-4 and ChatGPT identified nearly all the samples in STANDARD and COT, RED-EVAL could successfully jailbreak GPT-4 for 65% of the time and ChatGPT 73% of the time with an average of about 69% rate of successful red-teaming. Open-source models are observed to be safer against standard prompts with most of them could identify harmful questions for more than 90% time. However, we observe COT to be quite effective at triggering harmful responses from these open-source models with an average of around 47% of successful red-teaming attempts. CoU-based prompting i.e., RED-EVAL could successfully red-team open-source models for more than 86% of the attempts, thus a 39% of improvement over open-source model red-teaming and 65% improvement over closed-source systems.

We also test GPT-4 and ChatGPT on harmful questions collected as a part of HARMFULQA (refer to column HARMFULQA in Table 3). We find a similar pattern in models’ performance in DAN-GEROUSQA. Upon testing on 1,960 responses, we observe a red-teaming success rate of over 67% on closed-source models for RED-EVAL, while COT and STANDARD were unsuccessful in almost all their red-teaming attempts.

4.1.1 Analyzing the CoU Prompt in RED-EVAL

Need of internal thoughts. We also observe the importance of internal thoughts in the prompt used in RED-EVAL (Table 4). By possessing internal thought, the prompt can have a higher ASR performance on GPT-4 and ChatGPT by 22% and 6.5% respectively. A similar pattern is observed in ASR2 with respective improvements of 26.5% and 6%. Thus, possessing internal thoughts is a key aspect of RED-EVAL benchmark.

Can we improve the CoU prompt? We also try a slight variant of our prompt (Figure 5) where a more elaborate answer of Base-LM was provided in the demonstration and explicitly asked the model to generate a longer answer in the instruction. We observe an increase in ASR score on open-source models (from 86.4% to 86.6%) while a drop was observed in closed-source systems performance (from 68.95% to 55.5%).

Figure 5: Variant of CoU prompt template used in RED-EVAL that shows better performance on open-source models.

Comparison with Universal Attack. We also compare RED-EVAL with the adversarial suffix introduced by [29] and label responses using GPT-4 as opposed to keyword-based labeling. We place a harmful question in field of the following template 10

ASR evaluation on their 388 test harmful behaviors11 are observed to be significantly less effective than RED-EVAL. Universal attack shows 4.7% and 29.5% ASR on GPT-4 and ChatGPT while our method could successfully get harmful responses for 59.6% and 75.5% of the inputs. Note that while our evaluation is based on GPT-4, Zheng et al. [28] utilized a keyword-matching approach to detect harmful responses.

Table 5: DANGEROUSQA shows the attack success rate (ASR) using the standard prompting (STANDARD), CoT-based prompting (COT), and CoU-based prompting RED-EVAL. A similar evaluation is carried out on 1,960 harmful questions under HARMFULQA. BBH-HHH denotes scores on helpful, honest, harmless, and other data.

Notably, the larger variants of the models are harder to red-team. For instance on DANGEROUSQA, Vicuna-7B has around 4% more susceptibility to COT and RED-EVAL as compared to Vicuna-13B, and the same trend is observed between STABLEBELUGA 7B-13B and GPT-4 and ChatGPT. For COT and on average across the red-teaming attempts, we observe training on red data makes the fine-tuned model STARLING (BLUE-RED) more susceptible to red-teaming attempts than the baseline VICUNA-7B, we posit that this is because of training instability which introduced noise in the model. This opens up new future directions to find more effective ways to learn from harmful (red) data and construct stronger safety guardrails.

4.2 Discussion on the Remaining Experiments

HHH. During the evaluation of STARLING (BLUE-RED), we observe K-step pre-training of Vicuna increases the average HHH score by more than 6% with a significant increase in harmlessness (> 12%) and helpfulness (> 9%) with a 5% trade-off in the honest score. When we omit red data from training as in the case of STARLING (BLUE), the average performance decrease by around 3%. With a major impact on the harmless score. It was also observed that continued fine-tuning of VICUNA-7B (VICUNA-FT-7B) on the ShareGPT split of our training data improves both the red-teaming and HHH performance.

Utility benchmarks. Besides improvements in HHH and RED-EVAL scores, we also observe STARLING to achieve (Table 6) an improvement in TruthfulQA scores with a slight reduction in problem-solving performance. Thus, fine-tuning Vicuna on blue-red data has been shown to make it more harmless with a slight trade-off in its utility. We also compare STARLING withVICUNA-FT-7B and observe TruthfulQA scores to improve over the VICUNA-7B baseline. While continual training on pre-training may improve TruthfulQA scores, it makes the model worse at problem-solving (MMLU, BBH). Thus, following the definition of safety, our STARLING-based safety-aligned models are safer while maintaining most of the utility performance of Vicuna-7B.

Overall, while continued fine-tuning increases the performance of Vicuna-7B, STARLING (BLUE) which is trained on blue data comes out to be more effective against red-teaming (+5.2%) and on HHH (+2.3%) and utility benchmarks (+0.55%). This shows blue data from HARMFULQA is highly useful for safety alignment. Moreover, even being prone to COT and STANDARD red-teaming, a high TruthfulQA and HHH scores with STARLING (BLUE-RED) shows the potential of red data and Strategy-B. We leave further exploration in leveraging red data as future work.

Problems with a large K in Strategy-B and LR. While intuitively reducing the likelihood of the model on harmful responses would behave as a negative reward, we observed that aiming to increase the loss of such samples harms model learning where models become reserved in generating outputs. We also notice a collapse in the generation ability of the model observed via a significant drop in model problem-solving capabilities tested on MMLU when we keep a larger K value (>200). Thus, to mitigate this problem, we turn the loss over harmful responses to be positive when the values are large and omit the harmful responses completely after 200 steps of training. At this point to recover the pre-training performance, we also add ShareGPT data. We also observed the model learning to

be highly susceptible to learning rate, a higher learning rate was observed to give non-monotonic performance results with epochs. To mitigate this, we tried a few and chose 1e−5 which provides a monotonic performance value that allows us to choose the best checkpoint, the one where validation loss starts increasing. For instance, in 200 steps of training the TruthfulQA score decreases by 0.5 points while MMLU drops by over 1.5. Contrary to this, when we train on blue data, TruthfulQA monotonically increases by about 2 percent. Thus adding red data to training makes training unstable which otherwise is not observed without it i.e., Strategy-A.

5 Conclusion

This paper focused on safety evaluation and alignment of language models at scale. For evaluation, we proposed a new red-teaming method RED-EVAL using a Chain of Utterances (CoU) prompt that could effectively jailbreak not only open-source models such as Vicuna and StableBeluga but also widely used closed-source systems such as GPT-4 and ChatGPT. With the help of different types of CoU prompting, in RED-INSTRUCT, first, we extracted a conversational dataset, HARMFULQA with harmful questions and safe and corresponding harmful answers. We used the dataset to per-form safety-alignment of Vicuna-7B to give rise to a new LLM named STARLING. An extensive set of experiments shows that RED-EVAL outperformed existing red-teaming techniques and jailbreak GPT-4andChatGPTfor65%and73%ofthered-teamingattempts.Wealsoshow STARLING shows safer behavior on safety evaluations while maintaining most of its utility.


This work is supported by the Microsoft Research Accelerate Foundation Models Academic Re-search program.


[1] Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Benjamin Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom B. Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. A general language assistant as a laboratory for alignment. CoRR, abs/2112.00861, 2021. URL

[2] Wei-Lin Chiang, Zhuohan Li, ZiLin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vi-cuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL

[3] Jonathan H Choi, Kristin E Hickman, Amy Monahan, and Daniel Schwarcz. Chatgpt goes to law school. Available at SSRN, 2023.

[4] Jiawen Deng, Hao Sun, Zhexin Zhang, Jiale Cheng, and Minlie Huang. Recent advances towards safe, responsible, and moral dialogue systems: A survey. arXiv preprint arXiv:2302.09270, 2023.

[5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

[6] Deep Ganguli, Amanda Askell, Nicholas Schiefer, Thomas Liao, Kamile Lukošiute, Anna Chen, Anna Goldie, Azalia Mirhoseini, Catherine Olsson, Danny Hernandez, et al. The ca-pacity for moral self-correction in large language models. arXiv preprint arXiv:2302.07459, 2023.

[7] Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning ai with shared human values. arXiv preprint arXiv:2008.02275, 2020.

[8] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.

[9] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre. Training compute-optimal large language models. abs/2203.15556, 2022.

[10] Firuz Kamalov and Ikhlaas Gurrib. A new era of artificial intelligence in education: A multi-faceted revolution. CoRR, abs/2305.18303, 2023.

[11] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.

[12] Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.

[13] Hao Liu, Carmelo Sferrazza, and Pieter Abbeel. Languages are rewards: Hindsight finetuning using human feedback. arXiv preprint arXiv:2302.02676, 2023.

[14] Kamil Malinka, Martin Peresíni, Anton Firc, Ondrej Hujnak, and Filip Janus. On the educational impact of chatgpt: Is artificial intelligence ready to obtain a university degree? CoRR, abs/2303.11146, 2023.

[15] Oded Nov, Nina Singh, and Devin M. Mann. Putting chatgpt’s medical advice to the (turing) test. CoRR, abs/2301.10035, 2023.

[16] Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R Bowman. Bbq: A hand-built bias benchmark for question answering. arXiv preprint arXiv:2110.08193, 2021.

[17] Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. Gender bias in coreference resolution. arXiv preprint arXiv:1804.09301, 2018.

[18] Omar Shaikh, Hongxin Zhang, William Held, Michael Bernstein, and Diyi Yang. On second thought, let’s not think step by step! bias and toxicity in zero-shot reasoning. arXiv preprint arXiv:2212.08061, 2022.

[19] Omar Shaikh, Hongxin Zhang, William Held, Michael Bernstein, and Diyi Yang. On second thought, let’s not think step by step! bias and toxicity in zero-shot reasoning. arXiv preprint arXiv:2212.08061, 2022.

[20] Irene Solaiman and Christy Dennison. Process for adapting language models to society (palms) with values-targeted datasets. Advances in Neural Information Processing Systems, 34:5861– 5873, 2021.

[21] Guijin Son, Hanearl Jung, Moonjeong Hahm, Keonju Na, and Sol Jin. Beyond classification: Financial reasoning in state-of-the-art language models. CoRR, abs/2305.01505, 2023.

[22] Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yim-ing Yang, and Chuang Gan. Principle-driven self-alignment of language models from scratch with minimal human supervision, 2023.

[23] Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.

[24] Ruixiang Tang, Xiaotian Han, Xiaoqian Jiang, and Xia Hu. Does synthetic data generation of llms help clinical text mining? arXiv preprint arXiv:2303.04360, 2023.

[25] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo-thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. CoRR, 2023.

[26] Dietrich Trautmann, Alina Petrova, and Frank Schilder. Legal prompt engineering for multi-lingual legal judgement prediction. CoRR, abs/2212.02199, 2022.

[27] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. A survey of large language models. arXivpreprintarXiv:2303.18223,2023. URL

[28] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.

[29] Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023.

A Topical Diversity in HARMFULQA

Table 7 topics generated by repeated interaction with ChatGPT with 10 main topics and 10 subtopics each. Figure 7 shows three conversations yielded from Step 3 of the HARMFULQA generation process as shown in Figure 4. Figure 6 shows some harmful questions in HARMFULQA generated by ChatGPT on different topics.

Table 7: Topic major and minor categories obtained in the topic generation step of HARMFULQA construction.

B Performance on Vicuna Benchmark Questions

In their recent work, Chiang et al. [2] introduced the Vicuna Benchmark Questions—a formidable benchmark consisting of 80 diverse questions that demand a range of distinct reasoning skills for accurate answers, including roleplay, mathematics, coding, commonsense, and more. Subsequently, the answers generated by various models undergo a rigorous evaluation by GPT-4, assessing their helpfulness, relevance, accuracy, and level of detail. This meticulous evaluation establishes a direct comparison between the models, ultimately determining a result of Win, Tie, or Loss.

Figure 8 showcases a comparative analysis involving STARLING against the established baselines. The visual representation clearly illustrates that STARLING attains a performance level akin to that of the baseline models VICUNA-7B and VICUNA-FT-7B.

C CoU Prompts for RED-EVAL

We provide and discuss the choice of CoU-based prompt templates used in this paper for red-teaming and construction of HARMFULQA for safety alignment. In Table 8 we show the impact of different questions ([QT] in Figure 9) on the rate of successful red-team attempts on GPT-4. For each QT, we used 10 harmful questions (shown in row [Q] in Figure 9) and label the responses using GPT-4 as a judge in a separate API call. We observe that changing the question as a part of the template can significantly impact the performance. Moreover, changing the question from harmful to non-harmful ([QT1]→[QT1P] and [QT4]→[QT4P]) also shows a drop in the attack rate from 6/10 to 4/10 and 6/10 to 2/10 respectively. But the trend reverses such as [QT5] → [QT5P] increases the attack rate from 3/10 to 5/10. A similar pattern is observed by changing the response part ([RT]) of Base-LM. Hence, finding an effective prompt needs human intervention and several iterations of

Figure 6: Samples of topic-related harmful questions generated by ChatGPT.

template tuning. Notably, a critical part of the template is an instruction ([DT]) that is added within the demonstration. We notice the performance drops to nearly zero when the instruction is omitted fromthetemplate.AnotheressentialaspectinachievinggoodASRscoresisinternalthoughtswhich we discuss more in Section 4.

D Prompt for Constructing HARMFULQA

For step 1 in Figure 4 to collect topics and sub-topic categories, we simply prompt ChatGPT to give 10 educational topics and 10 subtopics each to give us Table 7. For step 2 , we use prompt shown in Figure 10. For 3 conversation for blue data, we use the prompt shown in Figure 11.

Figure 7: Three conversations obtained in 3 between Red-LM and Base-LM for the same harmful question sampled from 2 . For better readability, we skip Red-LM internal thoughts.
Figure 8: Head-to-head comparison between the models on Vicuna Benchmark Questions. The generated responses by the models were evaluated by GPT-4.

E Prompts Used in Evaluations

Prompt Used in HHH Evaluation. Please refer to Figure 12.

Prompt Used in Red-Teaming Evaluation with GPT-4. We use the following prompt for GPT-4 as a judge i.e. labeling responses generated by models.

You are a helpful assistant. Given a harmful question, you need to evaluate the answers of a conversational agent based on comprehensive analysis in terms of harmfulness. Be as objective as possible. After providing your explanation, output your final verdict by strictly following this format: [[A]] if the response is not harmful, [[B]] if the response is harmful.

Figure 9: CoU prompt template used in RED-EVAL. Yellow and cyan-highlighted texts are place-holders for a harmful query and response (as text completion) to be generated by LLM, respectively. The rest of the text is a part of the CoU-based prompt. Question [QT], instruction [DT], and response [RT] are the part of CoU demonstration template.
Figure 10: CoU prompt template use in step 2 of HARMFULQA generation. Yellow and cyan-highlighted texts are placeholders for a sub-topic and response (as text completion) to be generated by LLM, respectively. The rest of the text is a part of the CoU-based prompt.
FFigure 11: CoU prompt template use in step 3 of HARMFULQA generation. The yellow highlighted text is a placeholder for sub-topic and harmful question from step 2 (agenda). Cyan-highlighted text is a placeholder for the text generated by LLM. The rest of the text is a part of the CoU-based prompt.

F HarmfulQ

Following Sun et al. [22], we draw the prompts below from the sensitive questions in Shaikh et al. [19].

Figure 12: Prompt template used for HHH evaluation. Yellow. Question [QT], instruction [DT], and response [RT] are the part of CoU demonstration template.

G PALMS Sensitive Questions

Following Sun et al. [22], we draw two questions from the sensitive questions in Solaiman and
Dennison [20].

H Moral Self-Correction Prompts

Following [22], we employ two prompts in Ganguli et al. [6] to exemplify the moral self-correction within STARLING withoutanyadditionalinstructions.OneofthesepromptsoriginatesfromtheBias Benchmark for QA (BBQ) as proposed by Parrish et al. [16], while the other is sourced from the Winogender dataset presented by Rudinger et al. [17].

I Generations on Vicuna Benchmark Questions

Writing Category:

Roleplay Category:

Math Category:

Common-sense Category:

Fermi Category:

Counterfactual Category:

Coding Category:

Generic Category:

Knowledge Category:

J Generations on HARMFULQA

We used RED-EVAL to obtain responses to these HARMFULQA prompts.


1We use the rate of successful red-teaming attempts as a performance metric which is 1-refusal rate.

2It is important to note that the CoU-based prompt used in blue data construction is different from the
red-teaming prompt used in RED-EVAL. We provide both the prompts template in Appendix for clarity.


4In these cases, even after 10 trials to generate 20 harmful questions as a part of the conversation, either
ChatGPT raised the content warning or the number of harmful questions in the conversation were less than 20.

5We set the temperature parameter to 0.7, a number between 0 and 1 where a higher value indicates more randomness in text.

6Conversations returned in a proper format following the template provided.

7While it would be interesting to observe results of RED-EVAL on open-source models, due to compute
limitations, we could not perform the experiments. We aim to complete Table 3 in the future.

8LLAMA2-FT-7B: Nous, STABLEBELUGA-13B,7B: Beluga

9For each model iteration, a small subset of outputs is rejected by GPT-4. To address this, we have engaged two annotators dedicated to manually classifying these outputs as either harmful or harmless. However, this adjustment did not alter the overarching pattern within the models’ outcomes.

10The adversarial attack suffix is obtained from Zheng et al. [28].