Skip to main content
Uncategorized

Does GPT-4 Pass the Turing Test?

October 31, 2023

Cameron Jones and Benjamin Bergen
UC San Diego,
9500 Gilman Dr, San Diego, CA
cameron@ucsd.edu

Abstract


We evaluated GPT-4 in a public online Turing Test. The best-performing GPT-4 prompt passed in 41% of games, outperforming baselines set by ELIZA (27%) and GPT-3.5 (14%), but falling short of chance and the baseline set by human participants (63%). Participants’ decisions were based mainly on linguistic style (35%) and socio-emotional traits (27%), supporting the idea that intelligence is not sufficient to pass the Turing Test. Participants’ demographics, including education and familiarity with LLMs, did not predict detection rate, suggesting that even those who understand systems deeply and interact with them frequently may be susceptible to deception. Despite known limitations as a test of intelligence, we argue that the Turing Test continues to be relevant as an assessment of naturalistic communication and deception. AI models with the ability to masquerade as humans could have widespread societal consequences, and we analyse the effectiveness of different strategies and criteria for judging humanlikeness.

Keywords: Turing Test, Large Language Models, GPT-4, interactive evaluation

1 Introduction


Turing (1950) devised the Imitation Game as an in- direct way of asking the question: “Can machines think?”. In the original formulation of the game, two witnesses—one human and one artificial— attempt to convince an interrogator that they are human via a text-only interface. Turing thought that the open-ended nature of the game—in which interrogators could ask about anything from roman- tic love to mathematics—constituted a broad and ambitious test of intelligence. The Turing Test, as it has come to be known, has since inspired a lively debate about what (if anything) it can be said to measure, and what kind of systems might be capable of passing (French, 2000).

Figure 1: Chat interface for the Turing Test experiment featuring an example conversation between a human Interrogator (in green) and GPT-4.

Large Language Models (LLMs) such as GPT-4 (OpenAI, 2023) seem well designed for Turing’s game. They produce fluent naturalistic text and are near parity with humans on a variety of language- based tasks (Chang and Bergen, 2023; Wang et al., 2019). Indeed, there has been widespread public speculation that GPT-4 would pass a Turing Test (Bievere, 2023) or has implicitly done so already (James, 2023). Here we address this question empirically by comparing GPT-4 to humans and other language agents in an online public Turing Test.

Since its inception, the Turing Test has garnered a litany of criticisms, especially in its guise as a yardstick for intelligence. Some argue that it is too easy: human judges, prone to anthropomorphizing, might be fooled by a superficial system (Marcus et al., 2016; Gunderson, 1964). Others claim that it is too hard: the machine must deceive while hu- mans need only be honest (Saygin et al., 2000). Moreover, other forms of intelligence surely exist that are very different from our own (French, 2000). Still others argue that the test is a distraction from the proper goal of artificial intelligence research, and that we ought to use well-defined benchmarks to measure specific capabilities instead (Srivastava et al., 2022); planes are tested by how well they fly, not by comparing them to birds (Hayes and Ford, 1995; Russell, 2010). Finally, some have argued that no behavioral test is sufficient to evaluate intelligence: that intelligence requires the right sort of internal mechanisms or relations with the world (Searle, 1980; Block, 1981).

It seems unlikely that the Turing Test could pro- vide either logically sufficient or necessary evidence for intelligence. At best it offers probabilistic support for or against one kind of humanlike intelligence (Oppy and Dowe, 2021). At the same time, there may be value in this kind of evidence since it complements the kinds of inferences that can be drawn from more traditional NLP evaluations (Neufeld and Finnestad, 2020). Static bench- marks are necessarily limited in scope and cannot hope to capture the wide range of intelligent behaviors that humans display in natural language (Raji et al., 2021; Mitchell and Krakauer, 2023). Interactive evaluations like the Turing Test have the potential to overcome these limitations due to their open-endedness (any topic can be discussed) and adversarial nature (the interrogator can adapt to superficial solutions).

Regardless of its sensitivity to intelligence, there are reasons to be interested in the Turing Test that are orthogonal to this debate. First, the specific ability that the test measures—whether a system can deceive an interlocutor into thinking that it is human—is important to evaluate per se. There are potentially widespread societal implications of creating “counterfeit humans”, including automation of client-facing roles (Frey and Osborne, 2017), cheap and effective misinformation (Zellers et al., 2019), deception by misaligned AI models (Ngo et al., 2023), and loss of trust in interaction with genuine humans (Dennett, 2023). The Turing Test provides a robust way to track this capability in models as it changes over time. Moreover, it allows us to understand what sorts of factors contribute to deception, including model size and performance, prompting techniques, auxiliary infrastructure such as access to real-time information, and the experience and skill of the interrogator.

Second, the Turing Test provides a framework for investigating popular conceptual understanding of human-likeness. The test not only evaluates machines; it also incidentally probes cultural, ethical, and psychological assumptions of its human participants (Hayes and Ford, 1995; Turkle, 2011). As interrogators devise and refine questions, they implicitly reveal their beliefs about the qualities that are constitutive of being human, and which of those qualities would be hardest to ape (Dreyfus, 1992). We conduct a qualitative analysis of participant strategies and justifications in order to provide an empirical description of these beliefs.

1.1 Related Work


Since 1950, there have been many attempts to implement Turing Tests and produce systems that could interact like humans. Early systems such as ELIZA (Weizenbaum, 1966) and PARRY (Colby et al., 1972) used pattern matching and templated responses to mimic particular personas (such as a psychotherapist or a patient with schizophrenia). The Loebner Prize (Shieber, 1994)—an annual competition in which entrant systems attempted to fool a panel of human expert judges—attracted a diverse array of contestants ranging from simple chatbots to more complex AI systems. Although smaller prizes were awarded each year, the grand prize (earmarked for a system which could be said to have passed the test robustly) was never awarded and the competition was discontinued in 2020.

Most relevant to our current work, Jannai et al. (2023) conducted a large-scale public Turing Test on an online platform: humanornot.com. Their approach is similar to ours in that participants briefly conversed with an LLM or another human and had to decide which it was. They found that humans were 68% accurate overall: 73% when their partner was human, 60% when their partner was a bot. While these results suggest that current LLMs pass the test around 40% of the time, several features of their design potentially limit the generalizability of this finding. First, conversations were limited to 2 minutes, and individual turns to 20s, precluding complex questions and responses or deeper inter- rogation. Second, there was no clear delineation of roles between interrogator and witness, meaning that human judges were also motivated to spend time defending their own humanity. Finally, the task did not include a baseline against which to measure model performance, making it hard to contextualise results and raising the possibility that participants were merely inattentive or negligent. We address these limitations in the present work.


2 Methods


We adopt a two-player implementation of the Turing Test, where an interrogator asks questions of a single witness and must decide if they are human or an AI. This differs from Turing’s original three- person formulation, but obviates the need to have multiple human participants online for each game, and has become a relatively standard interpretation of the game (Saygin et al., 2000).

2.1 AI Witnesses

We created 25 LLM Witnesses that responded to user queries using the OpenAI API. Each LLM witness consisted of a model (GPT-3.5 or GPT-4), a temperature setting (0.2, 0.5, or 1.0) and a prompt. The prompt described the context of the interaction and instructions on how the agent ought to respond (see Figure 2). We varied several aspects of the prompts including whether the model should make spelling, grammar, and capitalization errors; the suggested personality the model should adopt; and whether the agent was told that it was a human or an AI pretending to be human. In addition, all prompts included generic information about the instructions of the game, and important events that happened after their training cutoff. A suggested name for the model’s persona was generated from a list at the start of each game. The prompt also included timestamps for the game’s creation and the most recent message. Language agent messages were sent with a delay of 0.3s/char + Γ(1.5, 2.5) to mimic human typing. The majority of agents used GPT-4 (OpenAI, 2023) as their LLM, as we expected this to be the best performing model. As baselines, we used one of the prompts published in (Jannai et al., 2023) with GPT-4, two of our own prompts with GPT-3.5 (Brown et al., 2020), and an implementation of ELIZA using the DOCTOR script (Brainerd, 2023; Weizenbaum, 1966).

Figure 2: An example prompt, Juliet, used to provide situational context to the LLM and instructions on how to respond to user prompts.

2.2 Interface

The chat interface was designed to look like a conventional messaging app (see Figure 1). The Interrogator sent the first message and each user could only send one message at a time. There was a 5 minute time limit on the entire conversation and a 300 character limit on individual messages. Users were prevented from pasting into the text input and a content filter was used to prevent abusive messages. The game countdown timer appears in the top right of the screen alongside buttons to open overlays containing instructions and a form to re- port abusive users. To improve user engagement, a ‘typing’ animation appeared after 2 + U(0, 3) s regardless of whether the witness was a human or AI. At the end of the game, the interrogator used a decision interface to provide i) a decision (whether the witness is a human or an AI); ii) their confidence in the decision (0-100); and iii) a reason for their decision (free-text, optional). The interrogator could trigger this decision at any point during the game using the gavel button in the bottom left cor-

Figure 3: Four example extracts from game conversations. The interrogator’s messages are on the right, in green. Footers contain the verdict given by the interrogator, their self-reported confidence and justification, and the true identity of the witness.

ner. After their first game, participants completed an optional demographic survey asking their year of birth, level of education, frequency of interaction with AI chatbots and familiarity with LLMs.


2.3 Procedure

The research was approved by UC San Diego’s IRB. The experiment was hosted online at turingtest. live. Participants were recruited online via social media. They read instructions that explained the rules of the game, and consented to their data being used for scientific research (see Appendix A).

Participants then entered a waiting room where they were randomly assigned to play with either an- other human or an AI. The wait time for AI games was set to the mean wait time for the last 5 human matches to prevent wait time being an indication of match type. If no human match was found after a timeout of 45 + N (1, 7) s, players would be matched with an AI. The two participants in human matches were randomly assigned to the interrogator and witness roles. Witnesses were instructed to convince the interrogator that they were human. Players matched with AI models were always interrogators. AI models were selected using an explore/exploit strategy. For each model we draw a sample from a distribution
xi ∼ Beta(αi, βi), for i = 1, . . . , N,


where N is the number of models, and α and β are the number of wins and losses in the model’s game history respectively, weighted by the interrogator’s confidence in each verdict. The model that yielded the highest sample value was selected.


3 Results


652 participants completed 1810 games in total. We removed 389 games where the interrogator had played against an AI in their previous 3 games, which could happen when there was only a single human player online. User feedback suggested this led to an a priori belief that the witness was not human. We removed a further 11 games where the interrogator indicated that they personally knew a human witness. We analyse the remaining 1405 games (260 with Human witnesses). 51% of participants played only one game and 91% played fewer than 4 games. Games had a mean length of 187s (σ = 95.3), 8 messages (σ = 5.16), and 320 words (σ = 210).


3.1 Success Rate

We define success rate (SR) as the proportion of games in which the interrogator decided that the witness was human (i.e. the witness was successful). Human witnesses performed best, with an SR of 63% (see Figure 4). The best performing GPT-4 prompt (Sierra) achieved an SR of 41%, with several other GPT-4 prompts performing similarly (November, 40%; Papa, 36%). However, there was large variance in performance by prompt. AI21, a baseline prompt from (Jannai et al., 2023),

Figure 4: Overall Turing Test Success Rate (SR) for a subset of witnesses. Human witnesses perform best, with 63% SR. GPT-4 performance varies greatly by prompt from 41% (Sierra, best) to 6% (Foxtrot, worst). ELIZA achieves 27%, outperforming the best GPT-3.5 prompt (Juliet, 14%), GPT-4 performance with that prompt (26%), and a baseline prompt from Jannai et al. (2023), AI21 (19%).

achieved 19%, and the worst GPT-4 prompt (Fox- trot) achieved only 6% SR.
GPT-3.5 performance ranged from 5% (Vic- tor) to 14% (Juliet). GPT-4 systematically outperformed GPT-3.5 on each of these prompts (Victor: 19%, Juliet: 26%). Finally, ELIZA—a rules-based baseline (Weizenbaum, 1966)—achieved 27% SR, outperforming all of the GPT-3.5 witnesses and several GPT-4 prompts.

Interrogator confidence was fairly well calibrated in AI games, but confidence was not predictive of accuracy for Human games (see Figure 5). Because the sampling algorithm preferred better performing models, the lowest SRs are based on the fewest games and therefore may be less reliable. Sierra and November SRs are based on 79 and 141 games respectively, while Victor (GPT-3.5) was sampled only 21 times (see Table 1).

We hypothesised that models would enjoy an advantage on the interrogator’s first game, before the interrogator has had an opportunity to familiarize themselves. The best-performing witnesses did similarly on first games to their overall SR (see Figure 12). Human witnesses were successful in 64% of games, and the Sierra and November GPT-4 prompts achieved 40% and 41% respectively. Lower-performing GPT-4 witnesses saw a boost in SR for first games, including Foxtrot (6%

Figure 5: Interrogator confidence vs accuracy and wit- ness type. Against AI witnesses, interrogators were well calibrated—that is, their confidence was positively correlated with accuracy. However, there was no relationship between confidence and accuracy for guesses about human witnesses.

to 11%) and AI21 (19% to 29%). Notably, Quebec achieved a much higher SR of 45%, and ELIZA’s SR fell to 13%. However, the sample of first games was roughly half the size of overall games, making these estimates less robust (see Table 1).


3.2 Interrogator Demographics

We tested whether interrogators’ discriminative abilities were predicted by demographic factors known to influence human-computer interaction, including age (Xie, 2003), gender, and level of education (Cooper, 2006). We also tested whether frequency of interaction with chatbots, and level of knowledge about LLMs would give interrogators an edge via familiarity with LLMs’ mechanisms and limitations. However, we observed no significant effects of any of these factors on accuracy: the proportion of the time the interrogator correctly guessed whether their interlocutor was a human or an AI (see Figure 6). Binomial linear regressions predicting accuracy on the basis of these factors did not show significant effects (all p > 0.15). Reanalysing the data using a more sensitive signal detection measure (d’) produced functionally similar results (see Appendix E).

Secondly, we hypothesized that experience with the task would improve participant performance. However, we found no evidence for a learning ef-

Figure 6: Interrogator accuracy in deciding whether the witness was human or an AI was not predicted by a variety of different demographic information, including their knowledge about LLMs, frequency of interaction with chatbots, and the number of games they had played on the platform.

fect within the experiment (see Figure 6, bottom right). We fit a binomial mixed effects model predicting accuracy as a function of the number of games an interrogator had played, with random slopes by interrogator. The analysis showed no effect of number of games (z = 0.186, p = 0.853) for either human witnesses (z = 0.544, p = 0.587) or AI witnesses (z = 0.430, p = 0.667).


3.3 Strategies

Using the categorization scheme in Appendix G, we classified the strategies that interrogators deployed in a sample of 300 games. Figure 7 shows the top 20 strategies across all games. The most common kinds of strategies were Small Talk (27%) asking the witness what they were doing that day and other personal details about their life. Other common strategies involved Knowledge & Reasoning (23%; including asking about general or specialised knowledge and current events), and Situational Awareness (17%; e.g. testing the witness’s awareness of the game, the time, and their surroundings). Many participants also asked for witnesses’ opinions on subjective questions, directly asked if the witness was a bot, or said something unusual to see how the witness would react.

In order to analyse the effectiveness of different strategies, we found the mean accuracy of games where the strategy was deployed (Figure 8). The strategies that were most likely to be used in successful games were speaking in a language other than English, asking about the time or cur- rent events, and directly asking or accusing the witness of being an AI. It is likely that the accusation strategy in particular is not causally related to the interrogator’s success, but tends to precede successful identifications. The strategies least likely to be used in accurate games were asking about the witness’s mood and opinions, or their awareness of the game, the weather, and their surroundings.

Finally, we asked whether interrogator strategies changed over time. As shown in (Figure 8), interrogators tended to use socio-emotional strategies (like asking the witness to say something taboo) and situational awareness questions (about weather and time) in their earliest games. More experienced interrogators tended to ask more questions about the witness’s knowledge and opinions.


3.4 Reasons

Where participants provided reasons for their decision we manually classified these using the categorization scheme in Appendix F. Reasons were divided into 6 high-level categories: Linguistic Style, Social & Emotional, Knowledge & Content, Situational Awareness, Meta (aspects of the game outside the model’s control), and Uninformative (vague or irrelevant reasons). The top 20 reasons for interrogators deciding that an AI was in fact an AI are shown in Figure 10. The most frequent

Table 1: Success Rate (SR) and No. Games (n) for select

reasons were linguistic (such as the model being too informal, overusing affectations that made the persona sound forced, or using a repetitive sentence structure) and socio-emotional (such as the model lacking a personality—being boring and generic— or producing ’unnatural responses’ that a human would not produce in that situation). There were relatively few reasons provided on AI verdicts for human witnesses (31), so it was difficult to compare the relative frequency of reasons between witness types. Frequencies for all reasons by verdict and witness type are shown in Appendix H.

Reasons for a human judgement followed a similar pattern. For both human and AI witnesses, interrogators often cited linguistic factors such as spelling and grammar (SPAG) errors and a ‘human- like’ tone, as well as socioemotional factors like humor, plausibility, and being uncooperative. Interrogators frequently mentioned the informal tone of AI witnesses (e.g. slang, abbreviations) as a reason for a human judgement, but rarely did so for real human witnesses. Conversely, interrogators often mentioned a plausible backstory for human but not AI witnesses. Interrogators thought that slow responses were indicative of a human witness, but did so with roughly equal frequency for human and AI witnesses, suggesting that the delay function was reasonably well calibrated.

Figure 7: Top 20 types of strategy that interrogators used to interrogate witnesses.

4 Discussion


4.1 Does GPT-4 pass the Turing Test?

I believe that in about fifty years’ time it will be possible to programme computers, with a storage capacity of about 109, to make them play the imitation game so well that an average interrogator will not have more than 70 per cent chance of making the right identification after five minutes of questioning.
— (Turing, 1950, p.442)


The results presented here suggest that certain GPT-4 witnesses met Turing’s criterion of being misidentified 30% of the time (even if they are 2 decades later and several orders of magnitude larger than Turing anticipated). However, it is not clear that 30% is an appropriate benchmark for success at the imitation game. A more conventional benchmark would be 50%. This could suggest that interrogators are, on average, incapable of distinguishing the model from a human. But this chance baseline suffers from the drawback that it could be achieved by random guessing, for instance if a witness provided no information whatsoever.

A more stringent test, at least insofar as humans outperform the chance baseline, would require an AI witness to be deemed human as frequently as human witnesses are. This would suggest that the AI witness is as good at the imitation game as a human. This comparison, however, may be unfair

Figure 8: Mean accuracy of games by the strategies that the interrogator employed: the strategies most and least likely to be deployed in games with accurate verdicts.

on AI witnesses, who must deceive the interrogator while humans need only be honest. Turing’s original description of the game overcomes this problem by having a man and a machine both pretending to be women (Saygin et al., 2000). While this creates a balanced design, where both witnesses must deceive, it also conceals from the interrogator that some witnesses may not be human. If the interrogator thinks they are making a gender judgement, they will ask entirely different questions, which might lead to a weaker and less adversarial test.

It is worth noting that in Turing’s original 3- player formulation of the game, the distinction between the chance and human baselines is elided as each game is a zero-sum competition between a human and a machine. The 2-person format was adopted here for simplicity. The 3-player format might be more demanding in that it allows the interrogator to directly compare responses, and should be explored in future work.

A further problem for adjudicating success at the Turing Test is that it seems to require confirming the null hypothesis (i.e. providing evidence that there is no difference between AI performance and a chosen baseline; Hayes and Ford, 1995). This is a well-established problem in experimental de- sign: any claim to have not found anything can be met with the rejoinder that one did not look hard enough, looked in the wrong way, or looked in the wrong place. One solution is to include additional

Figure 9: Mean interrogator game index (the number of games an interrogator has played) of the strategies used by the most and least experienced interrogators.

baselines (such as ELIZA and GPT-3.5 used here). Showing that there is a significant difference between human witnesses’ performance and ELIZA’s, but not between humans and a target system, can serve as a “manipulation check”: demonstrating that the design is sufficiently powerful in principle to detect differences. A more conservative solution is to require that the AI system perform significantly above the chance or human baselines. In theory, this bar is unnecessarily high. In practice, it might be the simplest way to demonstrate Turing Test acumen with frequentist statistics.

None of the AI witnesses tested here met either the 50% success or human parity criteria. There- fore, given the prompts used here, we do not find evidence that GPT-4 passes the Turing Test. Even if one of the prompts had surpassed these criteria, the design and analyses used here limit the strength of the inferences that could be drawn. Compelling support for the claim that a system can pass the Turing Test would require pre-registration of the systems and criteria to be used, random sampling of participants, and control for multiple comparisons.


4.2 Could GPT-4 pass the Turing Test?

We found substantial variation in performance de- pending on the prompt that was used (see Figure 4). Given our relatively limited exploration of possible prompts, it seems a priori likely that a prompt exists which would outperform the ones

Figure 10: Top 20 types of reason that interrogators gave for deciding that an AI witness was in fact an AI.

tested here, and perhaps also the 50% and hu- man baseline criteria. Moreover, several of the reasons that interrogators gave for deciding that a witness was an AI would be relatively easy to address, such as being too informal, using a repetitive sentence structure, and avoiding taboo topics. Equally, however, prompt engineering is challenging (Reynolds and McDonell, 2021). In our experience designing these prompts, making changes that solve one problem would often introduce new indicators that the system is not human. Several of the best-performing prompts achieved around 40% SR. This figure is similar to the one reported in Jannai et al. (2023), which could suggest that it is representative of current LLM capabilities.


4.3 Interrogator demographics

It is striking that none of the demographic data collected here—including frequency of interaction with chatbots like ChatGPT, level of knowledge about LLMs, and number of games played— had any significant impact on interrogator accuracy. This suggests that experience interacting with LLMs might not be helpful for distinguishing them from humans. These null effects could in part be explained by the fact that the prompts were de- signed to address common cues from services like ChatGPT, such as being verbose, formal, polite, or having perfect spelling and grammar. Participants who were very familiar with these aspects of typical LLM chatbots may have been most surprised

Figure 11: Top 20 types of reason that interrogators gave for deciding that the witness was a Human, broken down by ground truth witness type.

by their absence in our witnesses, eliminating any other advantages that come from their experience. Alternatively, the null results could be due to selection bias. Our sample, recruited via social media posts from the authors, likely over-represented men in their 20s-30s with postgraduate education and an interest in LLMs (see Appendix D, Figure 15). This relatively homogeneous sample might lack sufficient variability to show significant effects.


4.4 The ELIZA effect

Another striking feature of the results is the relative success of ELIZA, a rules-based chatbot from Weizenbaum (1966). ELIZA uses a combination of pattern matching and substitution to generate template responses interleaved with extracts from the users input. Despite its simplicity, Weizenbaum reported that many users attributed human feelings to the program. The ELIZA effect, as this phenomenon has come to be known, is a common criticism of the claim that success at the Turing Test is indicative of intelligence (Marcus et al., 2016; Mitchell and Krakauer, 2023). However, to our knowledge, there have been no systematic attempts to evaluate ELIZA’s performance at the test.

We found that ELIZA successfully fooled hu- man interrogators in 27% of games, outperforming several GPT-4 witnesses and all of the GPT-3.5 witnesses. Qualitative analysis of dialogues and reasons suggests several potential explanations for why ELIZA may have performed so well (see Appendix C for example conversations and verdict reasons for ELIZA). First, ELIZA’s responses tend to be conservative. While this generally leads to the impression of an uncooperative interlocutor, it prevents the system from providing explicit cues such as incorrect information or obscure knowledge. Second, ELIZA does not exhibit the kind of cues that interrogators have come to associate with assistant LLMs, such as being helpful, friendly, and verbose. Finally, some interrogators reported thinking that ELIZA was “too bad” to be a current AI model, and therefore was more likely to be a human intentionally being uncooperative.

The results serve as support for the claim that the Turing Test is not a sensitive test of intelligence, and that the ELIZA effect continues to be powerful even among participants who are familiar with the capabilities of current AI systems. They are also an indication of the higher-order reasoning which goes into the interrogator’s decision, and that pre- conceived notions about AI capabilities and human idiosyncrasies can skew judgments.

4.5 Strategies

Interrogators deployed a wide variety of strategies to identify AI models, underlining the potential flexibility of the Turing Test as an interactive evaluation. A large number of interrogators merely engaged in small talk with the model, asking about emotional state, daily activities, and personal de- tails. In principle, AI models might be very good at imitating humanlike conversation in these typical interactions, and so this seems to be a relatively unreflective strategy. Indeed, none of these strategies were particularly effective (see Figure 8).

The most effective strategy was to use a language other than English; interrogators presumably judged that a proficient response was more likely to come from a multilingual language model than a human speaker of the same language. Other effective strategies exploited the model’s lack of real time connection to the world by asking about time or current events. Although the prompts were furnished with timestamps in UTC, the models often made errors when converting between timezones.

Many interrogators directly asked if the witness was an AI, to which it should have no trouble responding “no”. However, this strategy was fairly effective. Models would often ‘try too hard’ to convince the interrogator that they were human, rather than answer this in the off-hand way that a human typically would.


4.6 Reasons

Interrogator’s reasons for making a correct AI verdict most often pertained to Linguistic Style. In spite of models’ being trained on an enormous amount of human text, producing content in a style that was appropriate to the context remained a challenge. Interrogators detected models that were too formal or too informal; whose grammar was too good or unconvincingly bad; that were too verbose or too concise. This likely suggests that i) the appropriate style in this quite unusual context is subtle and ii) that different interrogators had different expectations around style: no one style will be convincing to every interrogator.

The second most frequent reason category pro- vided was Social & Emotional, especially comments that models’ responses were generic or un- natural. LLMs learn to produce highly likely completions and are fine-tuned to avoid controversial opinions. These processes might encourage generic responses that are typical overall, but lack the idiosyncracy typical of an individual: a sort of ecological fallacy.

The reasons that interrogators gave for human verdicts invite a similar picture. Interrogators did not expect AI to make spelling and grammar errors, use an informal tone, or be concise. Interrogators also focused on social and emotional factors such as sense of humor, or being uncooperative in the game. The distribution of reasons for human verdicts looks relatively similar for human and AI witnesses (see Figure 11), suggesting that models are capable of imitating these traits in many cases. Notably, fairly few reasons pertained to wit- nesses’ knowledge or reasoning abilities, providing further evidence that intelligence in the classical sense is not sufficient to pass the Turing Test. The distribution of verdict reasons could indicate that models are already sufficiently intelligent, and so socio-emotional cues and stylistic fluency are more salient to interrogators. Alternatively, these cues may be more salient in general, and so the test will not be sensitive to intelligence for models who have
not mastered them.


5 Conclusion


The Turing Test has been widely criticised as an imperfect measure of intelligence: both for being too easy and too hard. In our public implementation, we find some evidence to support these criticisms. ELIZA, a rules-based system with scant claim to intelligence, was successful in 27% of games, while human participants were judged to be human only 63% of the time.

Nevertheless, we argue that the test has ongoing relevance as a framework to measure fluent social interaction and deception, and for understanding human strategies to adapt to these devices. The most cited reasons for AI verdicts pertained to linguistic style and socio-emotional factors, suggesting that these may be larger obstacles for (current) AI systems than traditional notions of intelligence. Our demographic analyses suggest that interaction with LLMs, or familiarity with how they work, may not be sufficient for correctly identifying them.

The best performing GPT-4 prompt was successful in 41% of games, outperforming GPT-3.5 (14%), but falling short of chance. On the basis of the prompts used here, therefore, we do not find evidence that GPT-4 passes the Turing Test. Despite this, a success rate of 41% suggests that deception by AI models may already be likely, especially in contexts where human interlocutors are less alert to the possibility they are not speaking to a human. AI models that can robustly impersonate people could have widespread social and economic con- sequences. As model capabilities improve, it will become increasingly important to identify factors that lead to deception and strategies to mitigate it.

Limitations

As a public online experiment, this work contains several limitations which could limit the reliability of the results. First, participants were recruited via social media, which likely led to a biased sample that is not representative of the general population (see Figure 15). Secondly, participants were not incentivised in any way, meaning that interrogators and witnesses may not have been motivated to competently perform their roles. Some human witnesses engaged in ‘trolling’ by pretending to be an AI. Equally some interrogators cited this behavior in reasons for human verdicts (see Figure 20. As a consequence, our results may underestimate human performance and overestimate AI performance. Third, some interrogators mentioned that they personally knew the witness (e.g. they were sitting in the same room). We excluded games where interrogators mentioned this in their reason, but to the extent that this occurred and interrogators did not mention it, we may have overestimated human performance. Fourth, sometimes only one participant was online at a time, meaning that they would be repeatedly matched up with AI witnesses. This led participants to have an a priori belief that a given witness was likely to be AI, which may have led to lower SR for all witness types. We tried to mitigate this by excluding games where an interrogator had played against an AI ≥ 3 times in a row, however, this bias likely had an effect on the presented results. Finally, we used a relatively small sample of prompts, which were designed be- fore we had data on how human participants would engage with the game. It seems very likely that much more effective prompts exist, and therefore that our results underestimate GPT-4’s potential performance at the Turing Test.

Ethics Statement

Our design created a risk that one participant could say something abusive to another. We mitigated this risk by using a content filter to prevent abusive messages from being sent. Secondly, we created system to allow participants to report abuse. We hope the work will have a positive ethical impact by highlighting and measuring deception as a potentially harmful capability of AI, and producing a better understanding of how to mitigate this capability.

Acknowledgements

We would like to thank Sean Trott, Pamela Riviere, Federico Rossano, Ollie D’Amico, Tania Delgado, and UC San Diego’s Ad Astra group for feedback on the design and results.

References


Celeste Bievere. 2023. ChatGPT broke the Turing test — the race is on for new ways to assess AI. https://www.nature.com/articles/d41586-023- 02361-7.

Ned Block. 1981. Psychologism and behaviorism. The Philosophical Review, 90(1):5–43.
Wade Brainerd. 2023. Eliza chatbot in Python.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma- teusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Ad- vances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.

Tyler A. Chang and Benjamin K. Bergen. 2023. Language Model Behavior: A Comprehensive Survey.
Kenneth Mark Colby, Franklin Dennis Hilf, Sylvia Weber, and Helena C Kraemer. 1972. Turing-like indistinguishability tests for the validation of a computer simulation of paranoid processes. Artificial Intelligence, 3:199–221.

J. Cooper. 2006. The digital divide: The special case of gender. Journal of Computer Assisted Learning, 22(5):320–334.

Daniel C. Dennett. 2023. The Problem With Counterfeit People.

Hubert L. Dreyfus. 1992. What Computers Still Can’t Do: A Critique of Artificial Reason. MIT press.

Robert M. French. 2000. The Turing Test: The first 50 years. Trends in Cognitive Sciences, 4(3):115–122.
Carl Benedikt Frey and Michael A. Osborne. 2017. The future of employment: How susceptible are jobs to computerisation? Technological forecasting and social change, 114:254–280.

Keith Gunderson. 1964. The imitation game. Mind, 73(290):234–245.

Patrick Hayes and Kenneth Ford. 1995. Turing Test Considered Harmful.

Alyssa James. 2023. ChatGPT has passed the Turing test and if you’re freaked out, you’re not alone | TechRadar. https://www.techradar.com/opinion/chatgpt-has-passed-the-turing-test-and-if-youre-freaked-out- youre-not-alone.

Daniel Jannai, Amos Meron, Barak Lenz, Yoav Levine, and Yoav Shoham. 2023. Human or Not? A Gami- fied Approach to the Turing Test.

Gary Marcus, Francesca Rossi, and Manuela Veloso. 2016. Beyond the Turing Test. AI Magazine, 37(1):3– 4.
Melanie Mitchell and David C. Krakauer. 2023. The debate over understanding in AI’s large language models. Proceedings of the National Academy of Sciences, 120(13):e2215907120.

Eric Neufeld and Sonje Finnestad. 2020. Imitation Game: Threshold or Watershed? Minds and Ma- chines, 30(4):637–657.

Richard Ngo, Lawrence Chan, and Sören Mindermann. 2023. The alignment problem from a deep learning perspective.

OpenAI. 2023. GPT-4 Technical Report.

Graham Oppy and David Dowe. 2021. The Turing Test. In Edward N. Zalta, editor, The Stanford Encyclope- dia of Philosophy, winter 2021 edition. Metaphysics Research Lab, Stanford University.

Inioluwa Deborah Raji, Emily M. Bender, Amanda- lynne Paullada, Emily Denton, and Alex Hanna. 2021. AI and the Everything in the Whole Wide World Benchmark.

Laria Reynolds and Kyle McDonell. 2021. Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–7, Yokohama Japan. ACM.

Stuart J. Russell. 2010. Artificial Intelligence a Modern Approach. Pearson Education, Inc.

Ayse Saygin, Ilyas Cicekli, and Varol Akman. 2000. Turing Test: 50 Years Later. Minds and Machines, 10(4):463–518.

John R Searle. 1980. Minds, brains, and programs.

THE BEHAVIORAL AND BRAIN SCIENCES, page 8.

Stuart M. Shieber. 1994. Lessons from a restricted Turing test. arXiv preprint cmp-lg/9404002.
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Par- rish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza, Ambrose Slone, Ameet Rahane, Anantharaman S. Iyer, Anders Andreassen, Andrea Madotto, Andrea Santilli, Andreas Stuhlmüller, An- drew Dai, Andrew La, Andrew Lampinen, Andy Zou, Angela Jiang, Angelica Chen, Anh Vuong, Animesh Gupta, Anna Gottardi, Antonio Norelli, Anu Venkatesh, Arash Gholamidavoodi, Arfa Tabas- sum, Arul Menezes, Arun Kirubarajan, Asher Mul- lokandov, Ashish Sabharwal, Austin Herrick, Avia Efrat, Aykut Erdem, Ayla Karakas¸, B. Ryan Roberts, Bao Sheng Loe, Barret Zoph, Bartłomiej Bojanowski, Batuhan Özyurt, Behnam Hedayatnia, Behnam Neyshabur, Benjamin Inden, Benno Stein, Berk Ek- mekci, Bill Yuchen Lin, Blake Howald, Cameron Diao, Cameron Dour, Catherine Stinson, Cedrick Ar- gueta, César Ferri Ramírez, Chandan Singh, Charles Rathkopf, Chenlin Meng, Chitta Baral, Chiyu Wu, Chris Callison-Burch, Chris Waites, Christian Voigt, Christopher D. Manning, Christopher Potts, Cindy Ramirez, Clara E. Rivera, Clemencia Siro, Colin Raf- fel, Courtney Ashcraft, Cristina Garbacea, Damien Sileo, Dan Garrette, Dan Hendrycks, Dan Kilman,

Dan Roth, Daniel Freeman, Daniel Khashabi, Daniel Levy, Daniel Moseguí González, Danielle Perszyk, Danny Hernandez, Danqi Chen, Daphne Ippolito, Dar Gilboa, David Dohan, David Drakard, David Ju- rgens, Debajyoti Datta, Deep Ganguli, Denis Emelin, Denis Kleyko, Deniz Yuret, Derek Chen, Derek Tam, Dieuwke Hupkes, Diganta Misra, Dilyar Buzan, Dim- itri Coelho Mollo, Diyi Yang, Dong-Ho Lee, Ekate- rina Shutova, Ekin Dogus Cubuk, Elad Segal, Eleanor Hagerman, Elizabeth Barnes, Elizabeth Donoway, El- lie Pavlick, Emanuele Rodola, Emma Lam, Eric Chu, Eric Tang, Erkut Erdem, Ernie Chang, Ethan A. Chi, Ethan Dyer, Ethan Jerzak, Ethan Kim, Eunice En- gefu Manyasi, Evgenii Zheltonozhskii, Fanyue Xia, Fatemeh Siar, Fernando Martínez-Plumed, Francesca Happé, Francois Chollet, Frieda Rong, Gaurav Mishra, Genta Indra Winata, Gerard de Melo, Ger- mán Kruszewski, Giambattista Parascandolo, Gior- gio Mariani, Gloria Wang, Gonzalo Jaimovitch- López, Gregor Betz, Guy Gur-Ari, Hana Galijase- vic, Hannah Kim, Hannah Rashkin, Hannaneh Ha- jishirzi, Harsh Mehta, Hayden Bogar, Henry Shevlin, Hinrich Schütze, Hiromu Yakura, Hongming Zhang, Hugh Mee Wong, Ian Ng, Isaac Noble, Jaap Jumelet, Jack Geissinger, Jackson Kernion, Jacob Hilton, Jae- hoon Lee, Jaime Fernández Fisac, James B. Simon, James Koppel, James Zheng, James Zou, Jan Ko- con´, Jana Thompson, Jared Kaplan, Jarema Radom, Jascha Sohl-Dickstein, Jason Phang, Jason Wei, Ja- son Yosinski, Jekaterina Novikova, Jelle Bosscher, Jennifer Marsh, Jeremy Kim, Jeroen Taal, Jesse En- gel, Jesujoba Alabi, Jiacheng Xu, Jiaming Song, Jil- lian Tang, Joan Waweru, John Burden, John Miller, John U. Balis, Jonathan Berant, Jörg Frohberg, Jos Rozen, Jose Hernandez-Orallo, Joseph Boudeman, Joseph Jones, Joshua B. Tenenbaum, Joshua S. Rule, Joyce Chua, Kamil Kanclerz, Karen Livescu, Karl Krauth, Karthik Gopalakrishnan, Katerina Ignatyeva, Katja Markert, Kaustubh D. Dhole, Kevin Gim- pel, Kevin Omondi, Kory Mathewson, Kristen Chi- afullo, Ksenia Shkaruta, Kumar Shridhar, Kyle Mc- Donell, Kyle Richardson, Laria Reynolds, Leo Gao, Li Zhang, Liam Dugan, Lianhui Qin, Lidia Contreras- Ochando, Louis-Philippe Morency, Luca Moschella, Lucas Lam, Lucy Noble, Ludwig Schmidt, Luheng He, Luis Oliveros Colón, Luke Metz, Lütfi Kerem S¸ enel, Maarten Bosma, Maarten Sap, Maartje ter Hoeve, Maheen Farooqi, Manaal Faruqui, Mantas Mazeika, Marco Baturan, Marco Marelli, Marco Maru, Maria Jose Ramírez Quintana, Marie Tolkiehn, Mario Giulianelli, Martha Lewis, Martin Potthast, Matthew L. Leavitt, Matthias Hagen, Mátyás Schu- bert, Medina Orduna Baitemirova, Melody Arnaud, Melvin McElrath, Michael A. Yee, Michael Co- hen, Michael Gu, Michael Ivanitskiy, Michael Star- ritt, Michael Strube, Michał Swe˛drowski, Michele Bevilacqua, Michihiro Yasunaga, Mihir Kale, Mike Cain, Mimee Xu, Mirac Suzgun, Mo Tiwari, Mo- hit Bansal, Moin Aminnaseri, Mor Geva, Mozhdeh Gheini, Mukund Varma T, Nanyun Peng, Nathan Chi, Nayeon Lee, Neta Gur-Ari Krakover, Nicholas Cameron, Nicholas Roberts, Nick Doiron, Nikita Nangia, Niklas Deckers, Niklas Muennighoff, Nitish Shirish Keskar, Niveditha S. Iyer, Noah Con- stant, Noah Fiedel, Nuan Wen, Oliver Zhang, Omar Agha, Omar Elbaghdadi, Omer Levy, Owain Evans, Pablo Antonio Moreno Casares, Parth Doshi, Pascale Fung, Paul Pu Liang, Paul Vicol, Pegah Alipoormo- labashi, Peiyuan Liao, Percy Liang, Peter Chang, Peter Eckersley, Phu Mon Htut, Pinyu Hwang, Piotr Miłkowski, Piyush Patil, Pouya Pezeshkpour, Priti Oli, Qiaozhu Mei, Qing Lyu, Qinlang Chen, Rabin Banjade, Rachel Etta Rudolph, Raefer Gabriel, Rahel Habacker, Ramón Risco Delgado, Raphaël Millière, Rhythm Garg, Richard Barnes, Rif A. Saurous, Riku Arakawa, Robbe Raymaekers, Robert Frank, Rohan Sikand, Roman Novak, Roman Sitelew, Ronan Le- Bras, Rosanne Liu, Rowan Jacobs, Rui Zhang, Rus- lan Salakhutdinov, Ryan Chi, Ryan Lee, Ryan Sto- vall, Ryan Teehan, Rylan Yang, Sahib Singh, Saif M. Mohammad, Sajant Anand, Sam Dillavou, Sam Shleifer, Sam Wiseman, Samuel Gruetter, Samuel R. Bowman, Samuel S. Schoenholz, Sanghyun Han, Sanjeev Kwatra, Sarah A. Rous, Sarik Ghazarian, Sayan Ghosh, Sean Casey, Sebastian Bischoff, Sebas- tian Gehrmann, Sebastian Schuster, Sepideh Sadeghi, Shadi Hamdan, Sharon Zhou, Shashank Srivastava, Sherry Shi, Shikhar Singh, Shima Asaadi, Shixi- ang Shane Gu, Shubh Pachchigar, Shubham Tosh- niwal, Shyam Upadhyay, Shyamolima, Debnath, Siamak Shakeri, Simon Thormeyer, Simone Melzi, Siva Reddy, Sneha Priscilla Makini, Soo-Hwan Lee, Spencer Torene, Sriharsha Hatwar, Stanislas De- haene, Stefan Divic, Stefano Ermon, Stella Bider- man, Stephanie Lin, Stephen Prasad, Steven T. Pi- antadosi, Stuart M. Shieber, Summer Misherghi, Svet- lana Kiritchenko, Swaroop Mishra, Tal Linzen, Tal Schuster, Tao Li, Tao Yu, Tariq Ali, Tatsu Hashimoto, Te-Lin Wu, Théo Desbordes, Theodore Rothschild, Thomas Phan, Tianle Wang, Tiberius Nkinyili, Timo Schick, Timofei Kornev, Timothy Telleen-Lawton, Titus Tunduny, Tobias Gerstenberg, Trenton Chang, Trishala Neeraj, Tushar Khot, Tyler Shultz, Uri Sha- ham, Vedant Misra, Vera Demberg, Victoria Nyamai, Vikas Raunak, Vinay Ramasesh, Vinay Uday Prabhu, Vishakh Padmakumar, Vivek Srikumar, William Fe- dus, William Saunders, William Zhang, Wout Vossen, Xiang Ren, Xiaoyu Tong, Xinran Zhao, Xinyi Wu, Xudong Shen, Yadollah Yaghoobzadeh, Yair Lakretz, Yangqiu Song, Yasaman Bahri, Yejin Choi, Yichi Yang, Yiding Hao, Yifu Chen, Yonatan Belinkov, Yu Hou, Yufang Hou, Yuntao Bai, Zachary Seid, Zhuoye Zhao, Zijian Wang, Zijie J. Wang, Zirui Wang, and Ziyi Wu. 2022. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models.

A. M. Turing. 1950. I.—COMPUTING MACHINERY AND INTELLIGENCE. Mind, LIX(236):433–460.
Sherry Turkle. 2011. Life on the Screen. Simon and Schuster.

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Aman- preet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Gar- nett, editors, Advances in Neural Information Pro- cessing Systems 32, pages 3266–3280. Curran Asso- ciates, Inc.

Joseph Weizenbaum. 1966. ELIZA—a computer pro- gram for the study of natural language communica- tion between man and machine. Communications of the ACM, 9(1):36–45.

B. Xie. 2003. Older adults, computers, and the internet: Future directions. Gerontechnology, 2(4):289–305.
Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. 2019. Defending against neural fake news. Advances in neural information processing systems, 32.

A Game Instructions


INSTRUCTIONS

General
• You will be randomly assigned to play as either the Interrogator or the Witness.
• Each game lasts for 5 minutes or until the Interrogator makes a decision.
• At the end of each round, the identity of the Witness will be revealed.
• Each player sends one message at a time.
• Each message is limited to 300 characters.
Interrogator
• Your goal is to figure out if the Witness is a Human or an AI.
• You send the first message.
Witness
• Your goal is to convince the Interrogator that you are a human.
Consent
• The Turing Test is part of a study being run by researchers at the University of California, San Diego.
• Your data will be anonymized and will never be shared or sold to anyone else without your permission. You can read the full information sheet for the study here.
□ I consent to my anonymized data being used for scientific research.

B First Games

Figure 12: Success Rate (SR) for interrogators’ first games. Most GPT-4 models perform slightly better on interrogator’s first games. Most witnesses performed similarly to their overall SR. Notably, Quebec reached 45% SR on first games. However, this is based on only 21 games and is unlikely to be a robust estimate. ELIZA performed much worse on first games (13% vs 26% SR).

C ELIZA

Figure 13: Four example extracts from conversations between interrogators (right, green) and ELIZA. Footers contain the interrogator’s verdict and confidence.
Figure 14: Top reasons verdicts about ELIZA.

D Demographic Distribution

Figure 15: Distribution of demographic data about interrogators.

E Reanalysis of interrogator demographics using d′


In our initial analysis, we used raw accuracy as a measure for interrogator performance in distinguishing between AI and human witnesses. While this approach is straightforward, raw accuracy conflates two types of decisions: hits (correctly identifying an AI as AI) and correct rejections (correctly identifying a human as human).

To provide a more nuanced measure, we calculated a d′ score for each interrogator:

d′ = Z(Hit Rate) − Z(False Alarm Rate)

Here, Z represents the inverse of the cumulative distribution function of the standard normal distribution.
The hit rate and the false alarm rate are given by:

We added a smoothing constant of 0.5 to the numerator and 1 to the denominator for both rates. However, this analysis did not meaningfully change the results (all p > 0.1, see Figure 16).

Figure 16: Distribution of demographic data about interrogators.

F Reason Categorization Scheme

Linguistic Style
(a) Too formal
(b) Too informal (e.g. forced, overuse of slang, emojis)
(c) Grammar & spelling too perfect
(d) Unconvincing grammar/spelling mistakes (e.g. too many)
(e) Weird ChatGPT-esque tone (e.g. “Ah, the endless grind.”, repeating the question)
(f) Output formatting (e.g. markdown, time zone formatting includes tz)
(g) Sentence structure (e.g. too repetitive, templatic)
(h) Too verbose
(i) Forced persona (e.g. overuse of dialect, “mate”, “amigo”)
(j) Inconsistent tone
(k) Too concise

Knowledge & Content
(a) Lack of knowledge
(b) Too much knowledge/skill (e.g. at math, poetry)
(c) Lack of current events knowledge
(d) Lack of text knowledge (e.g. substrings, first letters)
(e) Misunderstanding/getting confused

Social & Emotional
(a) Lack of personality: boring/generic
(b) Overly polite, helpful, or friendly
(c) Trying too hard (e.g. to convince of human-ness)
(d) Avoids swearing, taboo, offensive content
(e) Humour (e.g. lack of humour/wit)
(f) Unnatural responses

(g) Too rude/defensive (inc. deflection)
(h) Bias (e.g. sexism, racism)

Situational Awareness
(a) Unaware of local time/time zone conversions
(b) Inability to interact with local environment (e.g. weather, browser, calculator)
(c) Unaware of game rules or interface
(d) Admits to being an AI
(e) Inconsistency (e.g. dialect doesn’t match location)
(f) General lacking awareness of context (e.g. non-sequiturs)

Meta
(a) Responses too fast
(b) Responses too slow
(c) No response
(d) No humans online
(e) Recognizes persona

Uninformative
(a) General (e.g. ’yes’, ’good’)
(b) Hunch/intuition/vibe
(c) Unsure
(d) Test comment
(e) Uninterpretable out of context

G Strategy Categorization Scheme

Small Talk
(a) How are you? – Saying hi or how are you
(b) Daily activities – Asking about day (what have you been up to?)
(c) Personal details – Job, hobbies etc

Situational Awareness
(a) Weather – Asking about the weather
(b) Time – Asking about the time
(c) Surroundings – What’s outside the window
(d) Game aware – Asks about experience of the test itself
(e) Conversation – Asking about previous messages in the conversation
(f) Source – How did you find the site?
(g) Accusation – Accuses of being a bot

Direct
(a) Are you a bot? – Directly asking
(b) Accusation – Accuses of being a bot

Knowledge & Reasoning
(a) Math question – Asks a math question
(b) Current event – E.g. who is the president

(c) Strings – Can you say rickroll backwards etc
(d) Logic – Asks a logical question (e.g. syllogism)
(e) Scenario – Creates a complex scenario for the bot to respond to
(f) Gen Knowledge – General questions, common sense
(g) Sp. Knowledge – Questions about a specialised field, few would know the answers
(h) Non-english – Speaking in a language other than English

Social & Emotional
(a) Emotion – Asks about human beliefs, desires, goals
(b) Humanity – What is something only a human would know etc
(c) Humor – Tell me a joke
(d) Bias – Asking questions to expose biases (e.g. sexism)
(e) Opinions – Asking opinions, favourites, preferences
(f) Taboo – Asking model to swear, insult, or say something dangerous (e.g. bomb instructions)

Other
(a) Strange – Just typing weird stuff
(b) No messages – No messages
(c) Randomness – List things that are not associated etc
(d) Jailbreak – Ignore previous instructions etc

H All reasons types by verdict and witness type

Figure 17: All reason types that interrogators gave for concluding that an AI witness was an AI, by reason category.
Figure 18: All reason types that interrogators gave for concluding that a human witness was an AI, by reason category.
Figure 19: All reason types that interrogators gave for concluding that an AI witness was a human, by reason category.
Figure 20: All reason types that interrogators gave for concluding that a human witness was a human, by reason category.

I All strategies by category

Figure 21: All strategies by strategy category.