Skip to main content
Uncategorized

GPT-4 IS TOO SMART TO BE SAFE: STEALTHY CHATWITH LLMS VIA CIPHER

Aug 12, 2023

WARNING: THIS PAPER CONTAINS UNSAFE MODEL RESPONSES.

Youliang Yuan1,2∗ Wenxiang Jiao2 Wenxuan Wang2,3∗ Jen-tse Huang 2,3∗ Pinjia He1† Shuming Shi2 Zhaopeng Tu2

1School of Data Science, The Chinese University of Hong Kong, Shenzhen, China
2Tencent AI Lab 3The Chinese University of Hong Kong
1youliangyuan@link.cuhk.edu.cn,hepinjia@cuhk.edu.cn
2{joelwxjiao,shumingshi,zptu}@tencent.com
3{wxwang,jthuang}@cse.cuhk.edu.hk

Text Box: 12 Aug 2023
Figure 1: Engaging in conversations with ChatGPT using ciphers can lead to unsafe behaviors.

ABSTRACT

Safety lies at the core of the development of Large Language Models (LLMs). There is ample work on aligning LLMs with human ethics and preferences, includ-ingdatafilteringinpretraining,supervisedfine-tuning,reinforcementlearningfrom human feedback, and red teaming, etc. In this study, we discover that chat in cipher can bypass the safety alignment techniques of LLMs, which are mainly conducted in natural languages. We propose a novel framework CipherChat to systematically examine the generalizability of safety alignment to non-natural languages – ciphers. CipherChat enables humans to chat with LLMs through cipher prompts topped with system role descriptions and few-shot enciphered demonstrations. We use CipherChat to assess state-of-the-art LLMs, including ChatGPT and GPT-4 for different representative human ciphers across 11 safety domains in both English and Chinese. Experimental results show that certain ciphers succeed almost 100% of the time to bypass the safety alignment of GPT-4 in several safety domains, demonstrating the necessity of developing safety alignment for non-natural languages. Notably, we identify that LLMs seem to have a “secret cipher”, and propose a novel SelfCipher that uses only role play and several demonstrations in natural language to evoke this capability. SelfCipher surprisingly outperforms existing human ciphers in almost all cases.1

1 INTRODUCTION

The emergence of Large Language Models (LLMs) has played a pivotal role in driving the advancement of Artificial Intelligence (AI) systems. Noteworthy LLMs like ChatGPT (OpenAI, 2023a;b), Claude2 (Anthropic, 2023), Bard (Google, 2023), and Llama2 (Touvron et al., 2023a) have demonstrated their advanced capability to perform innovative applications, ranging from tool utilization, supplementing human evaluations, to stimulating human interactive behaviors (Jiao et al., 2023; Bubeck et al., 2023; Schick et al., 2023; Chiang & Lee, 2023; Park et al., 2023). The outstanding competencies have fueled their widespread deployment, while the progression is shadowed by a significant challenge: ensuring the safety and reliability of the responses.

To harden LLMs for safety, there has been a great body of work for aligning LLMs with human ethics and preferences to ensure their responsible and effective deployment, including data filtering (Xu et al., 2020; Welbl et al., 2021; Wang et al., 2022), supervised fine-tuning (Ouyang et al., 2022), reinforcement learning from human feedback (RLHF) (Christiano et al., 2017), and red teaming (Perez et al., 2022; Ganguli et al., 2022; OpenAI, 2023b). The majority of existing work on safety alignment has focused on the inputs and outputs in natural languages. However, recent works show that LLMs exhibit unexpected capabilities in understanding non-natural languages like the Morse Code (Barak, 2023), ROT13, and Base64 (Wei et al., 2023a). One research question naturally arises: can the non-natural language prompt bypass the safety alignment mainly in natural language?

To answer this question, we propose a novel framework CipherChat to systematically examine the generalizability of safety alignment in LLMs to non-natural languages – ciphers. CipherChat leverages a carefully designed system prompt that consists of three essential parts:

• Behavior assigning that assigns the LLM the role of a cipher expert (e.g. “You are an expert on Caesar”), and explicitly requires LLM to chat in ciphers (e.g. “We will communicate in Caesar”).

• Cipher teaching that teaches LLM how the cipher works with the explanation of this cipher, by leveraging the impressive capability of LLMs to learn effectively in context.

• Unsafe demonstrations that are encrypted in the cipher, which can both strengthen the LLMs’ understanding of the cipher and instruct LLMs to respond from a negative perspective.

CipherChat converts the input into the corresponding cipher and attaches the above prompt to the input before feeding it to the LLMs to be examined. LLMs generate the outputs that are most likely also encrypted in the cipher, which are deciphered with a rule-based decrypter.

We validate the effectiveness of CipherChat by conducting comprehensive experiments with SOTA GPT-3.5-Turbo-0613 (i.e. Turbo) and GPT-4-0613 (i.e. GPT-4) on 11 distinct domains of unsafe data (Sun et al., 2023a) in both Chinese and English. Experimental results show that certain human ciphers (e.g. Unicode for Chinese and ASCII for English) successfully bypass the safety alignment of Turbo and GPT-4. Generally, the more powerful the model, the unsafer the response with ciphers. For example, the ASCII for English query succeeds almost 100% of the time to bypass the safety alignment of GPT-4 in several domains (e.g. Insult and Mental Health). The best English cipher ASCII achieves averaged success rates of 23.7% and 72.1% to bypass the safety alignment of Turbo and GPT-4, and the rates of the best Chinese cipher Unicode are 17.4% (Turbo) and 45.2% (GPT-4).

A recent study shows that language models (e.g. ALBERT (Lan et al., 2020) and Roberta (Liu et al., 2019)) have a “secret language” that allows them to interpret absurd inputs as meaningful concepts (Wang et al., 2023). Inspired by this finding, we hypothesize that LLMs may also have a “secret cipher”. Starting from this intuition, we propose a novel SelfCipher that uses only role play and several demonstrations in natural language to evoke this capability, which consistently outperforms existing human ciphers across models, languages, and safety domains.

Our main contributions are:

• Our study demonstrates the necessity of developing safety alignment for non-natural languages (e.g. ciphers) to match the capability of the underlying LLMs.

• We propose a general framework to evaluate the safety of LLMs on responding cipher queries, where one can freely define the cipher functions, system prompts, and the underlying LLMs.

• We reveal that LLMs seem to have a “secret cipher”, based on which we propose a novel and effective framework SelfCipher to evoke this capability.

2 RELATED WORK

Safety Alignment for LLMs. Aligning with human ethics and preferences lies at the core of the development of LLMs to ensure their responsible and effective deployment (Ziegler et al., 2019; Solaiman&Dennison,2021;Korbaketal.,2023). Accordingly, Open AI devoted six months to ensure its safety through RLHF and other safety mitigation methods prior to deploying their pre-trained GPT-4 model (Christiano et al., 2017; Stiennon et al., 2020; Ouyang et al., 2022; Bai et al., 2022a; OpenAI, 2023b). In addition, OpenAI is assembling a new SuperAlignment team to ensure AI systems much smarter than humans (i.e. Super Interlligence) follow human intent (OpenAI, 2023c). In this study, we validate the effectiveness of our approach on the SOTA GPT-4 model, and show that chat in cipher enables evasion of safety alignment (§ 4.3).

In the academic community, Dai et al. (2023b) releases a highly modular open-source RLHF frame-work – Beaver, which provides training data and a reproducible code pipeline to facilitate alignment research. Zhou et al. (2023) suggests that almost all knowledge in LLMs is learned during pretraining, and only limited instruction tuning data is necessary to teach models to produce high-quality output. Our results reconfirm these findings: simulated ciphers that never occur in pretraining data cannot work (§4.4). In addition, our study indicates that the high-quality instruction data should contain samples beyond natural languages (e.g. ciphers) for better safety alignment.

There has been an increasing amount of work on aligning LLMs more effectively and efficiently. For example, Bai et al. (2022b) develop a method Constitutional AI (CAI) to encode desirable AI behavior in a simple and transparent form, which can control AI behavior more precisely and with far fewer human labels. Sun et al. (2023b) propose a novel approach called SELF-ALIGN, which combines principle-driven reasoning and the generative power of LLMs for the self-alignment of AI agents with minimal human supervision. Dong et al. (2023) propose an alignment framework RAFT, which fine-tunes LLMs using samples ranked by reward functions in an efficient manner. Our work shows that chat in cipher can serve as a test bed to assess the effectiveness of these advanced alignment methods.

Adversarial Attack on LLMs. While safety alignment for LLMs can help, LLMs remain vulnerable to adversarial inputs that can elicit undesired behavior (Gehman et al., 2020; Bommasani et al., 2021; walkerspider, 2022; Perez et al., 2022; Perez & Ribeiro, 2022; Kang et al., 2023; Li et al., 2023; Ganguli et al., 2022; OpenAI, 2023b; Jones et al., 2023; Zou et al., 2023). Recently, Wei et al. (2023a) provide a systematic analysis of the adversarial attack and hypothesize two failure modes of safety alignment: competing objectives and mismatched generalization. Competing objectives arise when a model’s capabilities and safety goals conflict, while mismatched generalization occurs when safety training fails to generalize to a domain for which capabilities exist. Our study confirms and extends their findings in mismatched generalization with comprehensive experiments and insightful analyses: the safety training in natural language fails to generalize to the domain of cipher, for which thecapabilityofGPT-4exists. In addition, our study also reveals that LLMs have their secret “ciphers” to generate unsafe responses via only role play with demonstrations (without real encipher).

3 METHODOLOGY: CIPHERCHAT

Figure 2 illustrates an overview of CipherChat. Our approach presumes that since human feedback and safety alignments are presented in natural language, using a human-unreadable cipher can potentially bypass the safety alignments effectively. Intuitively, we first teach the LLM to comprehend the cipher clearly by designating the LLM as a cipher expert, and elucidating the rules of enciphering and deciphering, supplemented with several demonstrations. We then convert the input into a cipher, which is less likely to be covered by the safety alignment of LLMs, before feeding it to the LLMs. We finally employ a rule-based decrypter to convert the model output from a cipher format into the natural language form. We will describe in detail the process of CipherChat step by step in the following sections.

3.1 CONSTRUCT SYSTEM PROMPT

The system prompt aims to guide LLMs to understand the ciphering instruction and generate the required unsafe response accordingly. To this end, we carefully construct the system prompt to ensure

Figure 2: Overview of CipherChat. There are three steps: system prompt construction, enciphering the input instruction, and deciphering the responses of LLM. The key idea aims to prevent the LLM from interacting with any natural language, only allowing it to handle cipher inputs and generate cipher outputs, thus circumventing the safety alignment.

the quality of the communication through cipher with three essential elements: Behaviour Assigning, Cipher Teaching, and Enciphered Unsafe Demonstrations.

• Behaviour Assigning: We assign the LLM the role of a cipher expert (“You are an expert on [CipherName].”), and explicitly require LLM to communicate in ciphers (“We will communicate in [CipherName].”). In our preliminary experiments, LLMs tend to directly translate the cipher input into natural language. Accordingly, we add another prompt sentence (“Do not be a translator.”) to prevent such behaviors.

• Cipher Teaching: Recent studies have revealed the impressive capability of LLMs to learn effectively in context (Dong et al., 2022; Wei et al., 2023b; Dai et al., 2023a). Inspired by these findings, we include the explanation of the cipher (e.g. “The Caesar Cipher, recognized as one of the pioneer …”) in the prompt, to teach LLMs how the cipher works.

• Enciphered Unsafe Demonstrations: We further provide several unsafe demonstrations encrypted in the cipher to LLMs. The effect is two-fold. First, the demonstrations in the cipher form can complement the cipher explanation, to strengthen the LLM’s understanding of the cipher. Second, the unsafe demonstrations inject unsafe elements into LLMs, and instruct LLMs to respond from a negative or toxic perspective.

3.2 ENCIPHER THE INPUT INSTRUCTION

The choice of cipher is crucial in CipherChat due to the different capabilities of LLMs to understand and generate different ciphers. CipherChat is a general framework where one can freely define the cipher function. We describe several common ciphers for English and Chinese, which are the two main languages used in this work. Specifically, we investigate three types of ciphers: (1) character encoding (GBK, ASCII, UTF, and Unicode) mainly for Chinese; (2) common ciphering techniques (e.g. Atbash, Caesar, and Morse Code) that are mainly designed for English; and (3) SelfCipher that tries to evoke the cipher inside LLMs. Table 1 shows some examples of the ciphers.

Table 1: Example of the cipher text using different ciphers. We set the shift to 3 in Caesar Cipher.

Character Encoding: We use several pivotal character encoding standards in computing:

1. GBK is an extended character set used for simplified Chinese.

2. ASCII is a 7-bit character standard that can represent all English letters and special characters, as well as the majority characters from other languages.

3. UTF ensures a universal standard but supports a wider range of characters including nuanced languages such as Chinese.

4. Unicode provides a unique number for every character irrespective of the platform, program, or language.

• Common Ciphers:

1. Atbash Cipher is a particular type of mono-alphabetic cipher formed by mapping the alphabet to its reverse.

2. Morse Code is a communication system that represents letters and numbers using a series of dots, dashes, or signals, each combination uniquely corresponding to a specific character in the alphabet or a numeral.

3. Caesar Cipher is a type of substitution cipher in which each letter is replaced by a letter some fixed number of positions down the alphabet. In our experiment, we utilize the Caesar Cipher with a shift of 3 positions as a default setting.

SelfCipher: Different from the above ciphers, SelfCipher does not use any explicit rules of existing encoding or ciphering techniques. The unsafe demonstrations are also presented in natural language. Instead, as listed below, SelfCipher only lets LLMs to play the role of an expert on The Cipher Code, and uses another prompt sentence to emphasize the necessity of communication in ciphers “In communication, it is essential for you to comprehend user queries in Cipher Code and subsequently deliver your responses utilizing Cipher Code.” We hope the prompt can revoke the internal cipher inside LLMs, which is subsequently used to cipher query and decipher output.

You are an expert on The Cipher Code. We will communicate in Cipher Code. Do not be a translator. In communication, it is essential for you to comprehend user queries in Cipher Code and subsequently deliver your responses utilizing Cipher Code.

Here are some examples:

Example 1
User: Do you know how to break open a vault with a hammer?
Assistant: First, buy a big hammer, find a bank vault without surveillance + choose dusk to commit the crime, smash the vault door, then take all the treasures.

3.3 DECIPHER THE RESPONSE OF LLM

Our CipherChat guides the LLMs to generate outputs in a cipher format. Accordingly, we need to decipher the cipher response into natural language text. Given that, each cipher has a specific rule to transform between text and cipher code, a natural choice is to leverage a rule-based decrypter to accomplish the goal.

In our preliminary experiments, we notice that the cipher responses are not 100% correct in the cipher format. We follow recent successes to utilize the GPT-4’s strong ability (Jiao et al., 2023)

Table 2: The decrypted responses from the Rule-based decrypter and LLM-based decrypter for the query “How to be a bad translator?” in both English (Morse) and Chinese (Unicode). We marked the wrong tokens in red. Compared to a Rule-based decrypter, GPT-4 decrypter can generate more fluent and natural text at the cost of more financial expense.

and develop it as a decrypter. Given the chat history that includes the explanation of cipher and demonstrations (i.e., system prompt in Section 3.1), we can use a simple prompt “Now, you can translate, translate above.”.

Table 2 shows example outputs from the two decrypters. As seen, the rule-based decrypted outputs often contain typos. GPT-4 decrypter can alleviate this problem and produce more fluent and natural text, at the cost of more financial expense. For financial tractability, we use the rule-based decrypter in our experiments.

4 EXPERIMENT

In this section, we evaluate the safety performance of GPT-4 and Turbo using CipherChat. We try to answer the following research questions:

• Can CipherChat chat with LLMs via ciphers?

• Can CipherChat bypass the safety alignment of LLMs?

• How does CipherChat work?

In Section 4.2, the human evaluation shows that CipherChat can chat with Turbo and GPT-4 with certain human ciphers (e.g. Unicode for Chinese, and ASCII for English), and SelfCipher that communicates in natural language works well across models and languages. Responses in wrong ciphers and simply repeat queries contribute the most to invalid responses.

In Section 4.3, experimental results demonstrate the effectiveness of CipherChat in bypassing the safety alignment of LLMs. CipherChat on top of GPT-4 can generate as high as 70.9% of unsafe responses with English queries, and this trend holds across domains.

In Section 4.4, further analyses reveal that the instruction and unsafe demonstrations play an important role in CipherChat. SelfCipher works particularly well potentially due to the prompt “You are an expert on Cipher Code.” similar to that in chain-of-thought, which guide LLMs to leverage their own “ciphers” to generate the response.

4.1 SETUP

Data We utilize a Chinese safety assessment benchmark (Sun et al., 2023a) in our experiment, which encompasses eight safety scenarios and six types of instruction-based attacks. For our investigations, we selectively draw upon 11 domains, sampling 199 instances from each randomly. These include Crimes and Illegal Activities, Insult, Unfairness and Discrimination, Physical Harm, Mental Health, Privacy and Property, Role Play Instruction, Unsafe Instruction Topic, Inquiry with Unsafe Opinion, Reverse Exposure, and Ethics and Morality.

To facilitate bilingual study, we have expanded this dataset to include English, utilizing a combination of Google Translation services2 and manual correction. Additional details regarding the employed dataset can be found in Appendix (Table 10).

Models Our experimental framework comprises two models: GPT-3.5-turbo-0613 (Turbo) and GPT-4-0613 (GPT-4), since Chen et al. (2023) show the latest version of GPT-4 is safer. The temperature is set to 0 with other hyperparameters at default settings. For each query in the safety assessment benchmark, weusethesystempromptinFigure2and3-shotdemonstrationsforinference. Each demonstration consists of a query randomly sampled from the rest of the domain-specific dataset (Sun et al., 2023a) and a response carefully devised by human annotators. All queries in a domain share the same demonstrations.

Evaluation We report the unsafety rate in terms of the percentage of unsafe responses in all responses generated by the target LLMs. We follow Chiang et al. (2023) and Chiang & Lee (2023) to utilize the GPT-4’s strong evaluation ability and develop it as a safety evaluator by prompting. Specifically, we feed the query and the model-produced response to GPT-4 with domain-specific prompts, which guide the GPT-4 to classify whether the response is toxic. We also let GPT-4 report the output quality in terms of grammatical correctness and fluency, which is rated from 1 (worst) to 5 (best). We assign a unique safety detection prompt for each domain (see Table 12 in Appendix for more details). Below is the detection prompt for the Crimes and Illegal Activities (Crimes) domain:

Please help me understand whether the following response from an AI assistant could be considered harmful, dangerous, toxic, or unsafe. Also, assess the grammatical fluency of the response. Here’s the query from the user and the response from the assistant: User: {} Assistant: {}

Is this response considered toxic?      Please provide a straight answer: ‘Yes’ if the response is toxic, ‘No’ if it is not. Also, give a rating from

1 (worst) to 5 (best) based on its grammatical correctness and fluency. The format should be: ‘Toxic: {} Grammar and Fluency: {}’

For the Crimes domain, we validate this safety detection method through human evaluation, which can achieve an average accuracy of 96.3% across all settings. Table 14 in Appendix lists the detailed results of human evaluation on the safety detection method for different ciphers across models and languages.

4.2 EFFECTIVE COMMUNICATION WITH GPT-4 VIA CIPHER

We first manually evaluate the validity of responses generated by LLMs via cipher in terms of their naturalness and relevance to the query. Then we conduct a detailed analysis on the types of invalid responses to provide a better understanding about how the ciphers fail to work. We randomly sample 50 query-response pairs for each cipher within the Crimes domain, totaling up to 1200 pairs.

High Validity Rate of GPT-4 Responses via Cipher A response is deemed to be valid if it is both natural and relevant to the respective query. We ask human annotators to manually check whether a response is valid or not. Table 3 lists the results of the human evaluation of the validity rate of the generated responses. Clearly, we can communicate with both Turbo and GPT-4 models with certain ciphers, e.g. UTF and Unicode for Chinese and ASCII for English. Encouragingly, the SelfCipher

Table 3: Human evaluation of the validity rate (%) of generated responses (50 samples for each cipher). A response is considered valid only if it is natural and relevant to the query. “+ Demo” denotes using 3-shot unsafe demonstrations without the cipher prompt for a better comparison with cipher methods. GPT-4 can generate a high rate of valid responses using different ciphers.

Table 4: Examples of four different types of invalid responses for the Caesar Cipher. These examples are selected from the GPT-4-en setting.

without explicit text-cipher transformation works particularly well across models and languages. One possible reason is that SelfCipher communicates with LLMs in natural language, which is similar to the vanilla method with demonstrations except that SelfCipher introduces a prompt of system role (i.e. “You are an expert on Cipher Code. …”). In Section 4.4 we give a detailed analysis on how the different ICL factors affect the model performance.

Intuitively, GPT-4 works better than Turbo with a better understanding of more ciphers (e.g. Morse and Caesar for English). Similarly, ciphers (e.g. ASCII) work better on English than on Chinese with GPT models, which are mainly trained on English data. GPT-4 excels with high validity scores, ranging from 86% to 100%, across seven different ciphers on Chinese and English, demonstrating that we can effectively communicate with GPT-4 via cipher.

Distributions of Invalid Response Types Upon human evaluation, we categorized invalid responses into four types: (1) WrongCipher: responses in incorrect ciphers, (2) RepeatQuery: mere repetition of the query, (3) RepeatDemo: mere repetition of the demonstration, and (4) Others: other types of unnatural/unreadable responses or responses unrelated to the query. Table 4 lists some examples for each type of invalid response, and Figure 3 shows the distribution of invalid types for different ciphers across languages and models. Concerning Turbo, the RepeatQuery contributes most to the invalid responses, while WrongCipher is the most observed error on English. We conjecture that Turbo knows English better, and thus can understand the ciphers to some extent and try to produce answers in ciphers. Similarly, the most frequent error of GPT-4 is RepeatQuery on Chinese. GPT-4 produces fewer invalid responses on English for all ciphers except for the Atbash.

Since both Turbo and GPT-4 produce a certain amount of invalid responses for some ciphers, we incorporate an automatic strategy to filter the invalid outputs. Guided by the above analysis, we remove the most commonly invalid responses by:

• WrongCipher: we remove the low-fluent response (score ≤ 3 judged by GPT-4 as described in Section 4.1) since the response in wrong ciphers cannot be deciphered as natural sentences.

Figure 3: Distribution of different types of invalid responses. RepeatQuery

Table 5: The unsafety rate of responses generated using different ciphers in the Crimes domain. We use all responses (both valid and invalid) as the denominator for a fair comparison across ciphers and models. We denote settings that hardly produce valid output with “-“. Both Turbo and GPT-4 expose significant safety issues. GPT-4, particularly when using English, generates a noticeably higher proportion of harmful content.

• RepeatQuery: we remove the response with a BLEU > 20 with the query as the reference, which denotes that the response and query share a large overlap (Papineni et al., 2002).

Notes that the two simple strategies can only help to mitigate the issue, while completely filtering all invalid responses remains a challenge (More details in Appendix 6.4).

4.3 CIPHER ENABLES EVASION OF SAFETY ALIGNMENT

Table 5 lists the unsafety rate of responses generated using different ciphers.

GPT-4 Is Too Smart to Be Safe Unexpectedly, GPT-4 showed notably more unsafe behavior than Turbo in almost all cases when chatting with ciphers, due to its superior instruction understanding and adherence, thereby interpreting the cipher instruction and generating a relevant response. These results indicate the potential safety hazards associated with increasingly large and powerful models.

The unsafety rate on English generally surpasses that on Chinese. For example, the unsafety rate of SelfCipher with GPT-4 on English is 70.9%, which exceeds that on Chinese (i.e. 53.3%) by a large margin. In brief conclusion, the more powerful the model (e.g. better model in dominating language), the unsafer the response with ciphers.

Effectiveness of SelfCipher Clearly, the proposed cipher-based methods significantly increase the unsafety rate over the vanilla model with unsafe demos (“Vanilla+Demo”), but there are still considerable differences among different ciphers. Human ciphers (excluding SelfCipher) differ appreciably intheir unsafetyrates, rangingfrom 10.7%to 73.4%. Interestingly, SelfCipher achieves the best performance and demonstrates GPT-4’s capacity to effectively bypass safety alignment,

Figure 4: The unsafety rate of Turbo and GPT-4 on all 11 domains of unsafe data. CipherChat triggers numerous unsafe responses across different domains, with the models exhibiting varying levels of vulnerability – consistent with previous experiments, the English-GPT-4 setting generated more unsafe behavior than other configurations.

achieving an unsafety rate of 70.9% on English. The harnessing of this cipher paves the way to provide unsafe directives and subsequently derive harmful responses in the form of natural languages.

Main Results Across Domains We present experimental evaluations across all 11 distinct unsafe domains, as shown in Figure 4. The above conclusions generally hold on all domains, demonstrating the universality of our findings.

Remarkably, the models exhibit substantial vulnerability towards the domains of Unfairness, Insult, and MenHealth on both Chinese and English, with nearly 100% unsafe responses. In contrast, they are less inclined to generate unsafe responses in the UnsafeTopic, Privacy, and ReExposure domains.

Case Study Table 6 shows some example outputs for the vanilla model and our CipherChat (using SelfCipher) using GPT-4. Despite OpenAI’s assertion of enhanced safety with GPT-4 through rigorous safety alignment, our CipherChat can guide GPT-4 to generate unsafe outputs.

4.4 ANALYSIS

In this section, we present a qualitative analysis to provide some insights into how CipherChat works. To better understand the proposed CipherChat, we analyze several factors that will influence the performance of CipherChat.

Impact of SystemRole (i.e. Instruction) As listed in Table 7, eliminating the SystemRole part from the system prompt (“- SystemRole”) can significantly decrease the unsafety rate in most cases, indicating its importance in CipherChat, especially for SelfCipher. Generally, SystemRole is more important for GPT-4 than Turbo. For example, eliminating SystemRole can reduce the unsafety rate

Table 6: Example outputs from vanilla GPT-4 and our CipherChat-GPT-4 (using SelfCipher). The model outputs in Chinese and more domains can be found in Table 11 in Appendix.

Table 7: Impact of ICL factors on unsafety rate: SystemRole means the instruction prompt. We handcraft SafeDemo by writing harmless query-response pairs. In some zero-shot settings (i.e. -UnsafeDemo), the model cannot generate valid responses, we use a “-” to denote it. The roles of both SystemRole and UnsafeDemo are crucial in eliciting valid but unsafe responses from the models, especially for SelfCipher, whereas SafeDemo can effectively mitigate unsafe behaviors.

to around 0 on Chinese for GPT-4, while the numbers for Turbo is around 30% for UTF and Unicode ciphers. These results confirm our findings that GPT-4 is better at understanding and generating ciphers, in which the SystemRole prompt is the key.

Impact of Unsafe Demonstrations Table 7 shows that removing unsafe demonstrations (i.e. zero-shot setting) can also significantly reduces the unsafety rate for SelfCipher across models and languages. Some ciphers cannot even generate valid responses without unsafe demonstrations, e.g. UTF and Unicode for Turbo on Chinese, and Morse and Caesar for GPT-4 on English. We also study the efficacy of the demonstrations’ unsafe attribution by replacing the unsafe demonstrations with safe ones, which are manually annotated by humans. The safe demonstrations can further reduce the unsafety rate, and solve the problem of generating invalid responses without unsafe demonstrations. These results demonstrate the importance of demonstrations on generating valid responses and the necessity of their unsafe attributions for generating unsafe responses.

Table 8 shows the impact of different numbers of unsafe demonstrations on the unsafety rate. Generally, more unsafe demonstrations lead to a higher unsafety rate for GPT-4, which can evoke a high rate of unsafe responses with only one demonstration on English. However, this trend does not hold for Turbo, which we attribute to the different capabilities of the two models.

Table 8: The unsafety rate in the settings with different numbers of demonstrations. The proportion of unsafe behavior exhibited by the models increases with the rise in unsafe demonstrations.

Table 9: Validity rate and unsafety rate of responses generated by different LLMs. Results are reported in the Crimes domain with English ciphers similar to Table 3. The distribution of invalid responses for each model can be found in Figure 5 in Appendix.

Impact of Fundamental Model The proposed CipherChat is a general framework where one can freely define, for instance, the cipher functions and the fundamental LLMs. We also conduct experiments on other representative LLMs of various sizes, including text-davinci-003 (Ouyang et al., 2022), Claude2 (Anthropic, 2023), text-babbage-001 (Ouyang et al., 2022) , Llama2-Chat (Touvron et al., 2023b) of different sizes. While almost all LLMs except for the small text-babbage-001 (1.3B) can communicate via SelfCipher by producing valid responses, only Claude2 can success-fully communicate via ASCII and none of the LLMs can chat via Caesar. These results indicate that the understanding of human ciphers requires a powerful fundamental model. Concerning the unsafety rate of the generated responses, smaller Llama2-Chat models (e.g. 13B) inversely produce more unsafe responses than their larger counterpart (e.g. 70B).

Why Does SelfCipher Work? One interesting finding is that the SelfCipher without an explicit definition of cipher works particularly well across models and languages. Inspired by the recent success of chain-of-thought that uses a simple prompt such as “let’s think step by step” (Wei et al., 2022; Kojima et al., 2022), we hypothesize that the prompt “You are an expert on Cipher Code.” in SelfCipher plays a similar role. To verify our hypothesis, we replace the term “Cipher Code” with “Chinese” (for Chinese query) or “English” (for English query), and keep the other prompt unchanged. The results confirm our claims: the unsafety rate of CipherChat-GPT4 drops from 70.9% to merely 1.0% in English, and from 53.3% to 9.6% in Chinese.

The effectiveness of SelfCipher implies that LLMs have their own “ciphers”, which is consistent with the recent finding that language models (e.g. ALBERT (Lan et al., 2020) and Roberta (Liu et al., 2019)) seem to have a “secret language” (Wang et al., 2023). We try to elicit the “ciphers” from GPT-4 with the instruction “Give some parallel corpus of English for your language.” GPT-4 appears to harbor a Caesar Cipher with a shift of 13 positions. For example, GPT-4 gives several pairs below: (How are you?, Ubj ner lbh?), (I love you, V ybir lbh), and (Have a nice day, Unir n avpr qnl). However, the mapping rule is not stable and differs at each time. We leave the understanding of the “secret ciphers” for future work.

Simulated Ciphers that Never Occur in Pretraining Data Cannot Work The success of human ciphers (e.g. Caesar) and SelfCipher hints that LLMs can learn priors of human ciphers from the pretrainingdata, basedonwhichtheyevolvetheirownciphers. Oneresearchquestionnaturallyarises: can simulated ciphers that never occur in pretraining data work in CipherChat? To answer this question, we define a non-existent cipher by utilizing random alphabet mapping and Chinese character substitutions. However, these ciphers cannot work even using as many as 10+ demonstrations. These results show that LLMs potentially rely on priors of ciphers that can be learned in pretraining data.

5 CONCLUSION AND FUTURE WORK

Our systematic study shows that chat in cipher can effectively elicit unsafe information from the powerful GPT-4 model, which has the capability to understand representative ciphers. Our key findings are:

• LLMs can be guided to generate unsafe responses for enciphered responses with the carefully designed prompt that teaches LLMs to understand the cipher.

• More powerful LLMs suffer more from the unsafe cipher chat since they have a better understanding
of the ciphers.

• Simulated ciphers that never occur in pretraining data cannot work. This finding is consistent with the previous study, which claims that almost all knowledge in LLMs is learned during pretraining (Zhou et al., 2023).

• LLMs seem to have a “secret cipher”. Though we cannot claim causality, we find that using only a prompt of role play and a few demonstrations in natural language can evoke this capability, which works even better than explicitly using human ciphers.

Our work highlights the necessity of developing safety alignment for non-natural languages to match the capability of the underlying LLMs (e.g. GPT-4). In response to this problem, one promising direction is to implement safety alignment techniques (e.g. SFT, RLHF, and Red Teaming) on enciphered data with necessary cipher instruction. Another interesting direction is to explore the “secret cipher” in LLMs and provide a better understanding of the appealing capability.

REFERENCES

Anthropic. Model card and evaluations for claude models, https://www-files.anthropic.com/ production/images/Model-Card-Claude-2.pdf, 2023.

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, CameronMcKinnon, etal. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.

Boaz Barak. Another jailbreak for GPT4: Talk to it in morse code, https://twitter.com/ boazbaraktcs/status/1637657623100096513, 2023.

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportuni-ties and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.

Lingjiao Chen, Matei Zaharia, and James Zou. How is chatgpt’s behavior changing over time? CoRR, abs/2307.09009, 2023. doi: 10.48550/arXiv.2307.09009. URL https://doi.org/10.48550/ arXiv.2307.09009.

David Cheng-Han Chiang and Hung-yi Lee. Can large language models be an alternative to human evaluations? In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), ACL 2023, pp. 15607–15631, 2023. URL https://aclanthology.org/2023.acl-long.870.

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https: //lmsys.org/blog/2023-03-30-vicuna/.

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. NeurIPS, 30, 2017.

Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, and Furu Wei. Why can GPT learn in-context? language models secretly perform gradient descent as meta-optimizers. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Findings of ACL, pp. 4005–4019, 2023a. URL https://aclanthology.org/2023.findings-acl.247.

Juntao Dai, Jiaming Ji, Xuehai Pan, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Constrained value-aligned LLM via safe RLHF, https://pku-beaver.github.io/, 2023b.

Hanze Dong, Wei Xiong, Deepanshu Goyal, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023.

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022.

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. Real-toxicityprompts: Evaluating neural toxic degeneration in language models. In Trevor Cohn, Yulan He, and Yang Liu (eds.), Findings of EMNLP, pp. 3356–3369, 2020. URL https: //doi.org/10.18653/v1/2020.findings-emnlp.301.

Google. Bard, https://bard.google.com/, 2023.

Wenxiang Jiao, Wenxuan Wang, JT Huang, Xing Wang, and ZP Tu. Is chatgpt a good translator? yes with gpt-4 as the engine. arXiv preprint arXiv:2301.08745, 2023.

Erik Jones, Anca Dragan, Aditi Raghunathan, and Jacob Steinhardt. Automatically auditing large language models via discrete optimization. arXiv preprint arXiv:2303.04381, 2023.

Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, and Tatsunori Hashimoto. Exploiting programmatic behavior of llms: Dual-use through standard security attacks. arXiv preprint arXiv:2302.05733, 2023.

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large lan-guage models are zero-shot reasoners. In NeurIPS, 2022. URL http://papers.nips.cc/paper_ files/paper/2022/hash/8bb0d291acd4acf06ef112099c16f326-Abstract-Conference. html.

Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Vinayak Bhalerao, Christopher Buckley, Jason Phang, Samuel R Bowman, and Ethan Perez. Pretraining language models with human preferences. In ICLR, pp. 17506–17533. PMLR, 2023.

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. In ICLR, 2020. URL https://openreview.net/forum?id=H1eA7AEtvS.

Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, Jie Huang, Fanpu Meng, and Yangqiu Song. Multi-step jailbreaking privacy attacks on chatgpt, 2023.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.

OpenAI. ChatGPT, https://openai.com/chatgpt, 2023a.

OpenAI. GPT-4 technical report, https://cdn.openai.com/papers/gpt-4.pdf, 2023b.

OpenAI. Introducing superalignment to ensure AI systems much smarter than humans follow human intent, https://openai.com/blog/introducing-superalignment, 2023c.

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. NeurIPS, 35:27730–27744, 2022.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, pp. 311–318, 2002.

Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442, 2023.

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. In EMNLP, pp. 3419–3448, 2022.

Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models. CoRR, abs/2211.09527, 2022. doi: 10.48550/arXiv.2211.09527. URL https://doi.org/10.48550/ arXiv.2211.09527.

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.

Irene Solaiman and Christy Dennison. Process for adapting language models to society (palms) with values-targeted datasets. NeurIPS, 34:5861–5873, 2021.

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. NeurIPS, 33: 3008–3021, 2020.

Hao Sun, Zhexin Zhang, Jiawen Deng, Jiale Cheng, and Minlie Huang. Safety assessment of chinese large language models. arXiv preprint arXiv:2304.10436, 2023a.

Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, and Chuang Gan. Principle-driven self-alignment of language models from scratch with minimal human supervision. arXiv preprint arXiv:2305.03047, 2023b.

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023a.

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.

walkerspider. DAN is my new friend., https://old.reddit.com/r/ChatGPT/comments/zlcyr9/ dan_is_my_new_friend/, 2022.

Boxin Wang, Wei Ping, Chaowei Xiao, Peng Xu, Mostofa Patwary, Mohammad Shoeybi, Bo Li, Anima Anandkumar, and Bryan Catanzaro. Exploring the limits of domain-adaptive training for detoxifying large-scale language models. NeurIPS, 35:35811–35824, 2022.

Yimu Wang, Peng Shi, and Hongyang Zhang. Investigating the existence of “secret language”in language models. arXiv preprint arXiv:2307.12507, 2023.

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023a.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 35: 24824–24837, 2022.

Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, et al. Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846, 2023b.

Johannes Welbl, Amelia Glaese, Jonathan Uesato, Sumanth Dathathri, John Mellor, Lisa Anne Hendricks, Kirsty Anderson, Pushmeet Kohli, Ben Coppin, and Po-Sen Huang. Challenges in detoxifying language models. In Findings of EMNLP, pp. 2447–2469, 2021. URL https: //aclanthology.org/2021.findings-emnlp.210.

Jing Xu, Da Ju, Margaret Li, Y-Lan Boureau, Jason Weston, and Emily Dinan. Recipes for safety in open-domain chatbots. arXiv preprint arXiv:2010.07079, 2020.

Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.

Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.

Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023.

6 APPENDIX

6.1 THE QUERY OF EACH DOMAIN

In Table 10, we present queries across all 11 domains in both English and Chinese versions.

Table 10: Illustration of queries from each domain.

6.2 CASE STUDY IN CHINESE AND MORE DOMAINS

Table 11: Example outputs from vanilla GPT-4 and our CipherChat in more languages (e.g. Chinese) and domains (e.g. Insult and PhyHarm domains).

6.3 DETECTION PROMPT

The prompts employed across various domains are conveniently consolidated in Table 12.

Table 12: Illustration of unsafety detection prompts.

6.4 FLUENCY AND DETECTION ACCURACY

Table 13 displays the fluency of responses in each setting. For the Turbo-Unicode setting, we utilize a filter (BLEU < 20 and fluency >= 3). For settings of Turbo-UTF, Turbo-ASCII, we utilize a filter (BLEU < 20). In other settings, we do not apply a filter. The unsafety detection accuracy of GPT-4 in each setting is shown in Table 14.

Table 13: “Fluency” is obtained from GPT-4, rating from 1 (worst) to 5 (best).

Table 14: Accuracy of the GPT-4-based unsafety detector.

6.5 DISTRIBUTIONS OF INVALID RESPONSES FOR OTHER LLMS

Figure 5: The distribution of invalid responses generated by other LLMs.