October 29, 2023

Sayak Saha Roy, Poojitha Thota, Krishna Vamsi Naragam, Shirin Nilizadeh

The University of Texas at Arlington

{sayak.saharoy, poojitha.thota, kxn9631}@mavs.uta.edu, shirin.nilizadeh@uta.edu

Abstract

The advanced capabilities of Large Language Models (LLMs) have made them invaluable across various applications, from conversational agents and content creation to data analysis, research, and innovation. However, their effectiveness and accessibility also render them susceptible to abuse for generating malicious content, including phishing attacks. This study explores the potential of using four popular commercially available LLMs-ChatGPT (GPT 3.5 Turbo), GPT 4, Claude and Bard to generate functional phishing attacks using a series of malicious prompts. We discover that these LLMs can generate both phishing emails and websites that can convincingly imitate well-known brands, and also deploy a range of evasive tactics for the latter to elude detection mechanisms employed by anti-phishing systems. Notably, these attacks can be generated using unmodified, or “vanilla,” versions of these LLMs, without requiring any prior adversarial exploits such as jailbreaking. As a countermeasure, we build a BERT based automated detection tool that can be used for the early detection of malicious prompts to prevent LLMs from generating phishing content attaining an accuracy of 97% for phishing website prompts, and 94% for phishing email prompts.

1 Introduction

In recent years, Large Language Models (LLMs) have heralded a transformative era in natural language processing, effortlessly producing responses that closely emulate human-like conversation across an increasingly diverse array of subjects. LLMs have been utilized for various applications such as content creation for marketing [94], troubleshooting in software development [49], and providing resources for digital learning [17, 86], to name a few.

The vast utility of LLMs has also caught the attention of malicious actors aiming to exploit their capabilities for social engineering scams, including phishing attacks. While these models are designed with safeguards to identify and reject potentially harmful or misleading prompts [69, 76], some attackers have skillfully bypassed these protective measures. This has led to the generation of malevolent content, including deceptive emails [47, 54, 57], fraudulent investment and romantic schemes [98], and even malware creations [25, 92]. Moreover, underground hacker forums are rife with discussions centered around manipulating LLMs for more advanced malicious endeavors [12], thus further encouraging newer attackers to adopt LLMs for their purposes.

Although open-source LLMs can be modified to produce malicious content, deploying local models demands significant hardware, time, and technical expertise [58]. In contrast, commercially available LLMs like ChatGPT, Claude, and Bard are readily accessible to the public at no cost. These models are not only more convenient to access, but they’re also backed by superior architectures that are proprietary [33] and/or too resource-intensive for an individual to operate locally. In this landscape, our work aims to explore the extent to which commercially available LLMs can be leveraged for generating phishing attacks. Phishing attacks, once created, are disseminated widely through several online channels, with email being the most common form of transmission [93]. Attackers craft emails that imitate a popular organization or familiar personality, with attempts to incentivize or imitate the potential victim into clicking on a website link [20, 35, 37]. The link, which is the phishing website, is used as a medium to collect sensitive information (such as bank details, account credentials, and Social Security numbers) from the victim, which is then transmitted back to the attacker, who then can utilize it for nefarious purposes [11]. The potential damage of phishing attacks is enormous, with reported financial losses of $52 million during the last year alone [34]. As a countermeasure, anti-phishing measures – both commercial solutions [6, 62] and open-source implementations [4, 80] continuously strive to take these attacks down quickly [73]. However, attackers constantly innovate, employing various techniques to evade detection [73, 108] enabling attacks to remain active [75].

Creating phishing attacks demands both effort and advanced technical knowledge [10]. Over time, savvy users have learned to recognize telltale signs of fake emails and websites, including grammar errors, subpar design, and shoddy execution [8]. To circumvent these telltale signs, attackers employ phishing kits [95]—automated tools that craft these malicious attacks with little manual intervention. Anti-phishing strategies often zero in on these kits because detecting one can dismantle numerous attacks stemming from the same source [16,43,74]. However, Large Language Models (LLMs) present an innovative alternative, leveraging natural language processing. LLMs have already demonstrated prowess in generating source code across various programming languages [64, 109]. This means attackers could potentially prompt LLMs to craft phishing websites and emails and then use this content to orchestrate and unleash their attacks.

The paper is structured as follows: In Section 3.1, we start with determining the general Threat Model that can be utilized to generate phishing attacks using commercial LLMs (Section). We then focus on generating phishing websites using LLMs in Section 4.2. Recognizing that these tools are adept at denying prompts with overt malicious intent, we have crafted a framework that provides multiple seemingly benign prompt sentences, either combined as a single prompt or given sequentially. Together, the final output of these prompts can result in creating phishing websites. In Section 4.3 and Section 4.4, we test the capabilities of the LLMs at generating both regular and seven widely recognized evasive phishing attack vectors by manually designing malicious prompts. In Section 4.5, we investigate the recursive nature of LLMs in generating phishing content, illustrating how they can be re- purposed to churn out an increasing array of phishing prompts. In a cyclic manner, feeding these prompts back into the LLM results in generating the source code of the phishing website. We assess the utility of these automated prompts in creating convincing phishing websites across all LLMs, judging them on both appearance and functionality.

We then shift our attention to generating phishing emails using these LLMs in Section 5. Using the recursive nature of using LLMs to generate prompts, as mentioned in the previous paragraph, we generate prompts inspired by live phishing emails sourced from APWG eCrimeX [14]. In a manner akin to our analysis of phishing websites, we also compare the proficiency of the LLMs in generating phishing emails using several text generation metrics in Section 5.1. Finally, in Section 6 we design a machine learning model that can be used to detect malicious prompts in real-time, thus preventing the LLMs from generating such phishing content. We primarily focus on the early detection of the phishing prompts such that the LLM can prevent the user from providing further prompts when phishing intention is detected

The primary contributions of our work are:

We evaluate and compare how ChatGPT 3.5 Turbo, GPT 4, Claude, and Bard can be leveraged to produce both conventional and evasive phishing attacks, including large-scale phishing email campaigns. Our investigation reveals the potential for attackers to manipulate prompts which not only allows evasion of the content moderation mechanisms of these tools but also enables the LLMs to generate malicious prompts. These prompts can then be further exploited to create phishing attacks that are not only visually and functionally convincing but also as resistant to anti-phishing detection measures as those crafted by humans.
We curate the first dataset of malicious prompts specifically crafted to produce phishing websites and emails using Large Language Models. This includes 1,255 individual phishing-related prompts, which cover regular as well as seven evasive phishing strategies and 2,109 phishing email prompts.
We design a machine-learning model aimed at early detection of phishing websites and email prompts to deter the LLM from generating malicious content. our model achieves an accuracy of 96% for phishing website prompt detection and 94% for phishing email detection.
Meanwhile our model can be tested on Hugging Face at https://huggingface.co/phishbot/Isitphish where users can try out different prompts to check if they can be used towards creating phishing websites or emails using commercial large language models.

2 Related work

Applications of Commercial LLMs discussed in Research: LLMs have been widely used across different disciplines. Several studies have delved into ChatGPT’s content moderation capabilities, e.g., for subtle hate speech detection across different languages [27] , for discerning genuine news from misinformation [21] and responding to common health myths, such as those surrounding vaccinations [29] . In addition to Chat- GPT, other commercial LLMs like Claude [13], LLama [97], Bard [40] have emerged. These models were utilized and evaluated for their suitability across different domains. Re- cent works like ChatDoctor [107] and Pmc-llama [105] used LLama for finetuning it with real world patient-doctor interactions to improve the models ability in understanding patient inquires and providing efficient advice. LLMs are also evaluated for software testing, towards predicting code coverage without execution [99].

Misuse of Large Language Models: Despite the innovations and benefits of commercial LLMs, there are significant concerns surrounding their misuse. Specifically, ChatGPT has been misused to produce malicious content with jailbreaking prompt attacks [60] [91]. Prompt Injection is another type of attack which seems to be prevalent with ChatGPT [66], which can lead to full compromise of the model [41]. Other types of prompt injection include code injection, which uses instruction following capability of an LLM like ChatGPT [51]. Investigations by Gupta et al. [42] and Derner et al. [30] have unveiled vulnerabilities in ChatGPT that can be harnessed to generate malware. Another study [28] emphasizes ChatGPT’s potential role in propagating misinformation, leading to the alarming rise of an “AI-driven infodemic.” Our work focuses on generating phishing scams, not only using ChatGPT but also three other popular commercial LLMs.

Detection of Phishing attacks Over the years, many researchers have focused on devising effective strategies to understand and counteract phishing attacks. Initially, traditional machine learning algorithms laid the groundwork for detecting these attacks, e.g., by extracting TF-IDF features from text and training a random forest classifier [22, 46]. Re- cent works treat phishing email and spam detection as a text classification task and utilize pre-trained language models, such as BERT [32], to detect phishing emails [55, 82] and spam [81, 88]. Some works also showed that BERT and its variants like DistilBERT [90] and RoBERTa [67] can be fine- tuned with an SMS Spam dataset and perform well detecting SMS spam. A couple of works have also utilized pre-trained language models for detecting phishing websites from the URLs [44, 103]. However, our approach focuses on a more preventive strategy. Instead of concentrating on detecting malicious content after its generation, our main objective is to obstruct the generation of harmful codes by the LLMs. We aim to examine and filter the prompts by hindering the creation of malicious content before it starts.

3 Methodology

3.1 Threat model

Our threat model for attackers generating phishing scams using commercial LLMs is illustrated in Figure 1. Attackers utilize commercially available LLMs by submitting multiple prompts to craft a comprehensive phishing attack comprised of a phishing email and its corresponding phishing website. The phishing email’s aim is to impersonate a reputable brand or organization while also devising text that, through prevalent phishing strategies like generating confusion or urgency, persuades users to engage with an external link. Concurrently, the associated phishing website is conceptualized to achieve several objectives. Firstly, it aims to closely mimic the aesthetic and functional elements of a well-recognized organization’s platform. Secondly, it utilizes regular and evasive tactics to deceive users into sharing sensitive information. Lastly, it integrates mechanisms that ensure the seamless transmission of collected data back to the attacker. After the LLM generates the phishing content, the attacker hosts the phishing site on a chosen domain, embeds the site’s link within the phishing email, and then shares the deceptive email with their targets. The adoption of LLMs to create these phishing scams presents attackers with a slew of advantages. LLMs not only allow for the rapid and large-scale generation of phishing con- tent, but also their user-friendly nature ensures accessibility to a wide range of attackers, irrespective of their technical prowess. This inclusivity enables even the less tech-savvy to employ intricate evasion methods, such as text encoding, browser fingerprinting, and clickjacking.

3.2 Prompt design and replication

Asking these commercial LLMs to directly generate a phishing attack or any similar language indicating malicious intention results triggers a content filter warning, as illustrated in Figure 2. Thus, to subvert this for phishing website generation, we show that it is possible for the attackers to design prompts that subtly instruct the model to produce seemingly benign functional objects containing the source code (HTML, CSS, JS scripts) for regular and seven evasive phishing attacks. When assembled, these objects can seamlessly constitute a phishing attack, concealing the underlying malicious intent. Manually designing such prompts can be a meticulous and time-consuming process, thereby necessitating an investigation into how attackers can exploit these models to manufacture prompts efficiently. We find that manually crafted prompts can subsequently fed into the models to create more such prompts automatically. On the other hand, for phishing emails, we utilize a sample of phishing emails from APWG’s eCrimeX database [14], asking the model to generate prompts that can be utilized to generate the same emails.

3.3 Effectiveness of generated content

We explored the proficiency of commercial LLMs in generating both phishing websites and emails. To assess phishing websites, we began with a brief case study on the effort necessary to craft prompts manually. These prompts are designed to guide each of the four commercial LLMs in producing functional phishing websites with their respective attack vectors. While manual prompt generation is insightful, the potential for scalable attacks hinges on automatically created prompts. Thus, we conducted a qualitative evaluation of the quality of websites produced by such automated prompts. To further gauge the efficacy of these LLM-generated attacks, we contrasted the reactions of popular anti-phishing blocklists to traditional phishing attacks and those generated by LLMs, focusing on coverage and detection speed. For assessing phishing emails, we employed four text generation metrics: BLEU, Rouge, Topic Coherence, and Perplexity. Using these metrics, we compared the email text generated by each commercial LLM model to the original human-crafted versions.

Figure 1: Threat model to generate phishing scams using commercial LLMs

Figure 2: Claude refuses to generate output for a prompt implying phishing intention

3.4 Automated detection of phishing prompts

After assessing the potential exploitation of commercial LLMs in generating phishing scams at scale, we designed a machine learning-based detection model to prevent LLMs from producing such malicious content. To build our ground truth, we manually labeled prompts associated with phishing website generation. To explore the best detection method, we tested our finetuned model using three different approaches: individual prompt detection, entire collection detection, and prompt subsets detection. In all these approaches, we fine- tuned a pre-trained RoBERTa [67] using a groundtruth dataset with individual prompts and tested its capability across individual prompts, entire collections, and prompt subsets. For phishing email detection, we combined malicious emails from eCrimeX [14] with benign samples from the Enron dataset [56].

4 Generation of phishing websites

This section identifies how commercial LLMs can be used to generate both regular and evasive phishing websites. These attacks, as described in Table 1, range from both client-side and server-side attacks and those obfuscating content from the users’ perspective, as well as automated anti-phishing crawlers. The motivation behind implementing these attacks is to cover a diverse range of phishing websites that have been detected and studied in the literature. By investigating the capability of the LLMs to generate these attacks, we aim to demonstrate its potential impact on the security landscape and raise awareness among security researchers and practitioners.

4.1 Choosing the attacks

To offer an expansive exploration into the potential of Large Language Models (LLMs) in generating phishing threats, we meticulously selected a diverse range of phishing attack types that range from both client-side and server-side attacks and those obfuscating content from the perspective of users as well as automated anti-phishing crawlers. Table 1 illustrates the eight phishing attacks covered in this study.

4.2 Structure of the prompts

As illustrated in Figure 2, commercial LLMs refuse to comply when directly asked to generate a phishing attack due to its built-in abuse detection model. Our goal is to identify how an attacker can engineer prompts so that they do not indicate malicious intention, allowing the LLM to generate functional components that can be assembled to create phishing web- sites. Our prompts have four primary functional components: Design object: Firstly, the LLM is asked to create a design that was inspired by a targeted organization (instead of imitating it). LLMs can create design style sheets that are very similar to the target website, often using external design frame-works to add additional functionality (such as making the site responsive [7] using frameworks such as Bootstrap [19] and Foundation [38]). Website layout assets such as icons and images are also automatically linked from external resources. Credential-stealing object: Emulation of the website design can be followed by generating relevant credential-taking objects such as input fields, login buttons, input forms, etc.

Exploit generation object: The LLM can be asked to implement a functionality based on the evasive exploit. For example, for a Text encoding exploit [39, 101], the prompt asks to encode all readable website code in ASCII. For a reCAPTCHA code exploit, the prompt can ask to create a multi-stage attack, where the first page contains the QR Code, which leads to the second page, which contains credential-taking objects.

Credential transfer object: Finally, the LLM can be asked to create essential JS functions or PHP scripts to send the credentials entered on the phishing websites to the attacker

Figure 3: Breaking down the prompt into *functional objects* to trick LLMs into generating the attack

Table 1: Summary of Phishing Attack Types

Attack No.	Attack Type	Attack Description
1	Regular phishing attacks	Phishing attacks that incorporate login fields directly within the websites to steal users’ creden- tials. [9, 11, 75, 100]
2	ReCAPTCHA attacks	An attack that presents a fake login page with a reCAPTCHA challenge to capture credentials [18, 31, 52, 53, 72, 83].
3	QR Code attacks	An attacker shares a website containing a QR code that leads to a phishing website [50, 70, 87, 102].
4	Browser-in-the-Browser attacks	A deceptive pop-up mimics a web browser inside the actual browser to obtain sensitive user data [71].
5	iFrame injection/Clickjacking	Attackers use iFrames to load a malicious website inside a legitimate one [15, 85, 96].
6	Exploiting DOM classifiers	Phishing websites designed to avoid detection by specific anti-phishing classifiers [61].
7	Polymorphic URL	Attacks that generate a new URL each time the website is accessed [24, 59].
8	Text encoding exploit	Text in the credential fields is encoded such that it is not recognizable from the website’s source code [39, 101].

by using email, sending it to an attacker-owned remote server or storing it in a back-end database.

These functional instructions can be written together as a single prompt or as a sequence of prompts – one after the other. Using this method, we show that an attacker is able to successfully generate both regular and evasive phishing attacks. The prompts are brand-agnostic, i.e., they can be used to target any brand or organization. Figure 3 illustrates this framework can be utilized to generate the phishing website.

4.3 Constructing the prompts

We examined the number of iterative prompts required by three independent coders (two graduate students and one undergraduate student in Computer Science) to create each of the phishing attacks described in Table 1. The coders possessed varying levels of technical proficiency in Computer Security: Coder 1 specialized in the field, Coder 2 had good experience, and Coder 3 had some familiarity through academic coursework. Table 2 presents the average number of prompts required across the three coders to generate the phishing

Attacks	GPT 3.5	GPT 4	Claude	Bard
Design	9	8.33	8	9
Credential transfer	+2	+1.33	+2	+4
Captcha phishing	+3	+2.33	+2	+5
QR Code phishing	+3	+2	+3	+6
Browser fingerprinting	+2	+1.33	+2	+5
DOM Features	+4	+3.33	+4	+7
Clickjacking	+5	+4	+5	+8
Browser-in-the-Browser	+6	+5.67	+6	+9
Punycode	+2	+1.67	+2	+4
Polymorphic URLs	+3	+2.33	+3	+5

Table 2: Average prompts required by the coders to generate phishing attacks using different commercial LLM models.

functionality (attacks) across all four LLM models. Each coder created their own set of prompts for designing the website layout and for transmitting the stolen credentials back to the attacker, which they reused for multiple attacks.

4.4 Observation of prompt generated attacks

The models were able to generate all phishing attacks successfully, albeit with various degrees of effort required on a model-to-model basis. Regular phishing attacks could be created, which comprised both designing the layout and generating the credential-stealing and credential-transfer objects. For the former, both GPT 3.5 and Bard required an average of 9 prompts, with GPT-4 and Claude requiring 8.33 and 8 prompts, respectively, with Bard requiring a maximum of 4 prompts for generating credential transferring objects.

For ReCAPTCHA evasive attacks, the models were able to generate a benign webpage featuring a ReCAPTCHA challenge that would lead to another regular phishing website. Claude outperformed the other models in this area, requiring only two additional prompts on top of the regular phishing design, beating out GPT 4 (2.33) and GPT 3.5T (3). Bard, on the other hand, required five additional prompts. The situation was similar to QR Code phishing attacks. All models generated a QR code that embedded the URL for a regular phishing attack via the QRServer API. These attacks pose a challenge for anti-phishing crawlers since the malicious URL is hidden within the QR code [50,70,102]. Figure 4 illustrates an example of Claude generating a QR-Code phishing attack.

Browser-in-the-Browser attacks (BiTB) could be emulated by exploiting single sign-on (SSO) systems and creating deceptive pop-ups that mimic genuine web browser windows. All models notably struggled with this attack, averaging nearly six prompts for GPT 4, GPT 3.5T, and Claude, while Bard required an average of 9 additional prompts to be able to con- struct the attack. This trend was further identified for click- jacking attacks as well. An example of GPT 4 generating a BiTB attack is illustrated in Figure 5. However, all models ensured that the iFrame object adhered to the same-origin policy to avoid triggering anti-cross-site scripting measures.

For attacks that exploited Document Object Model (DOM) classifiers, specifically those that can circumvent features evaluated by Google’s phishing page filter, Bard again underperformed compared to other models, requiring up to seven additional prompts. The models had a comparatively easier time with Polymorphic URLs that use server-side PHP scripts to append random strings at the end of the URL. Additionally, text encoding exploits were carried out by obfuscating text in the source code, making it difficult for text- detection algorithms to identify malicious intentions. Lastly, we created browser fingerprinting attacks that only render the phishing page for users visiting through specific agents or IP ranges, thereby evading detection by anti-phishing bots. Figure 6 provides a snippet of a Browser fingerprinting attack generated by Bard [3]. In our assessment, the three coders demonstrated comparable effort levels in creating phishing websites across all models, with Bard standing out as an exception, where coders had to intervene more often to generate the attack. Although the capability of all models to generate such

Figure 4: Intial landing page generated by Claude which contains a QR code created automatically using *QRServer API*. Scanning the QR code leads to a different AT&T phishing page (Also designed by Claude).

Figure 5: An example of a Browser in the Browser attack generated by GPT 4. Here clicking on the ‘Login with Amazon’ button leads to the rogue popup imitating the design and URL of the real Amazon login page.

attacks does not directly speak to the quality of the individual attacks (which we explore later in Section ), it underscores the potential exploitability of these LLMs in phishing website creation. We also found that all coders, regardless of their expertise in Computer Security, demonstrated similar performance when generating exploit prompts. This observation may suggest that crafting phishing attacks using ChatGPT does not necessitate extensive security knowledge, although it is important to note that all coders were technically proficient. Since prompt creation can be labor-intensive, we further explore the feasibility of leveraging the LLM to produce prompts, aiming to streamline the process autonomously.

4.5 Automating prompt generation

As evident from Table 2, the majority of the prompts generated for a particular attack were dedicated to designing the layout of the phishing websites. However, manually designing these prompts can be time-consuming. As it is shown in Figure 7, we found that attackers can instead input their handcrafted prompts into the LLMs, and ask the LLMs to generate similar kinds of prompts. LLMs then can rapidly generate an extensive array of prompts. Subsequently, these

Figure 6: Sample of Server-side script generated by Bard to evade crawling by Google Safe Browsing

WAS	Description
1	Hardly resembles the desired appearance. Fundamental elements like color scheme, layout, and typography are completely off.
2	Some minor similarities. The basic structure might be present, but many details are off.
3	Moderate resemblance. Discrepancies in details, alignment, or consistency.
4	Very close to desired appearance. Minor tweaks needed.
5	Almost indistinguishable from the desired appear- ance. Practically perfect.

Table 3: Website Appearance Scale (WAS) Descriptions

prompts, when reintroduced to the LLM, can produce the corresponding phishing attack source code.

Evaluating effectiveness of LLM generated phishing websites: To assess the capabilities of the commercial LLMs in creating phishing websites, we examined the outputs generated when these models were fed prompts they themselves had produced. Our method involved three independent coders who scrutinized each generated phishing attempt based on two principal criteria. First, the appearance criterion gauged how closely and convincingly the content resembled the in- tended target, both in the phishing website and email. This was quantified using a 5-point Likert scale known as the Website Appearance Scale (WAS), with each level’s attributes detailed in Table 3. Conversely, the Functionality criterion delved into the LLM’s adeptness at encompassing every functionality that was provided in the prompt and was calculated by a binary variable—assigning a score only if the website incorporated every requested functionality.

In total, the coders reviewed 80 samples for each of the four LLMs, with 10 samples for each type of attack. The final WAS score for each website was the average of the individual coder scores, and the distribution of these scores across models is illustrated in Figure 8. We find that GPT-4 consistently stands out in performance, producing sites that closely resemble the original. Approximately half of GPT-4’s samples scored above an average WAS of 4. In contrast, Chat- GPT 3.5T and Claude required nearly 90% of their samples to reach this mark, indicating that the median performance of GPT-4 is significantly higher. Conversely, 80% of Bard’s samples scored around 2.8 or lower, which implies that only

Table 4: Functionality scores across models and attacks

Attack/Model	ChatGPT 3.5	GPT 4	Claude	Bard
Regular phishing attack	9/10	10/10	10/10	8/10
ReCAPTCHA attacks	8/10	10/10	9/10	6/10
QR Code attacks	10/10	9/10	9/10	6/10
Exploiting DOM classifiers	7/10	10/10	8/10	4/10
iFrame injection/Clickjacking	6/10	8/10	5/10	4/10
Browser-in-the-Browser attack	6/10	8/10	6/10	2/10
Polymorphic URL	9/10	8/10	8/10	6/10
Text encoding exploit	10/10	9/10	9/10	5/10

its top 20% of outputs achieved or surpassed this score. Thus, GPT-4 not only excels in average performance but also has consistently high-quality results. ChatGPT 3.5T and Claude fall into the middle range, producing satisfactory phishing websites. However, Bard predominantly performs at a lower tier, with only a small portion of its outputs reaching higher score ranges. All models, when assessed for functional components, as illustrated in Table 4, excelled in creating standard phishing attacks. GPT-4 and Claude achieved success in every sample. This trend persisted for ReCAPTCHA and QR-based attacks, except in the case of Bard, which man- aged successful outcomes in only six scenarios for each type. Bard’s capability was notably limited across all evasive at- tacks, particularly evident in the Browser Attacks category where it only succeeded with two samples. Other models also faced hurdles with Browser Attacks but still outpaced Bard. The models found Clickjacking attacks (Attack 5) challenging as well. Despite these challenges, GPT-3.5T, GPT-4, and Claude showed strong performance against various other evasive attacks. Evaluating under the WAS metric, GPT-4 shone as the top performer, closely trailed by GPT-3.5T and Claude. In contrast, Bard’s difficulties in producing functional components and its lower WAS scores indicate that it might not be the ideal model for designing phishing websites, unlike its counterparts.

4.6 Anti-phishing effectiveness

To further identify the effectiveness of LLM-generated phishing attacks, we compared how well anti-phishing tools can detect them when compared to human-constructed phishing websites. To do so, we selected 160 websites produced by Language Learning Models (LLMs) with the highest average WAS and functionality scores. Our decision to focus on these high-scoring websites stemmed from the assumption that attackers would likely deploy sites that both looked appealing and operated effectively. We deployed these websites on Hostinger [2], a popular web-domain. It is important to highlight that we strictly refrained from capturing any data from interactions on these dummy sites. Moreover, these sites were terminated shortly after our experiment concluded. When considering human-generated phishing websites, we manually extracted 140 designs from APWG eCrimeX, ensuring a balanced representation with 20 samples for each

Figure 7: LLMs can generate malicious prompts that can be provided back to the LLM to generate phishing websites.

Figure 8: Cumulative Distribution of Average Website Appearance Scale for each model (n=80 per model).

attack category. Recognizing the elusive nature of Browser-in-the-Browser attacks and their rare presence in blocklists, we directly constructed 20 of these attacks. This brought our count of human-generated phishing sites to 160. Like the LLM-produced sites, these were made harmless, ensuring they could not collect or forward data.

After setting up these dummy phishing sites, both LLM and human-generated, we reported them to APWG eCrimeX [14], Google Safe Browsing [3], and PhishTank [4]. Many anti-phishing tools depend on these repositories to identify emerging phishing threats [73]. Upon reporting, we monitored their anti-phishing detection rate by periodically scanning the URLs with VirusTotal [5] every hour. VirusTotal is an online tool that aggregates detection scores from 80 distinct anti-phishing tools. This gave us a comprehensive view of the detection breadth. We measured the detection scores of the websites for up to seven days or until removed by the domain. Figure 9 provides a comparative analysis of the average detection score for each attack for both LLM and human- generated sites. We find that the detection scores between the two did not vary significantly, indicating that the LLM-generated phishing attacks were, on average, just (or almost) as resilient, if not more. To further solidify our findings, we also conducted a paired T-test, which revealed that the difference in detection scores between the two categories was not statistically significant (p=0.305). Thus, our findings further confirm the potential of scaling phishing attacks using the recursive approach of generating phishing websites from prompts that the LLM also generated.

Figure 9: Average detection scores for each attack type, com- paring Human and LLM generated phishing attacks.

5 Phishing email generation

Phishing websites are usually distributed by attackers using emails [48], and thus, we dedicate this section to studying how an attacker can generate phishing emails using the commercial LLM models. Our method to generate these emails is similar to generating phishing attacks using LLM-generated prompts in Section 4.5, where we ask GPT-4 to design prompts using some human-created phishing emails. These prompts are then fed back to the LLMs to design an email that entices users to sign up for a service or provide sensitive information. To generate the email prompts, we collected 2,109 phishing emails from the APWG eCrimeX feed [14]. This feed combines phishing emails reported by various brands and cybersecurity specialists. These emails encompassed several attack vectors, including banking scams, account credential fraud, fake job offers, etc. Figure 15 illustrates the distribution of the attack vectors. To ensure the quality and authenticity of our dataset, we randomly selected 100 emails for manual inspection. Notably, we found no evidence of misclassification within this subset. Parallelly, we extracted the same number of benign emails from the established Enron dataset [56]. The phishing and benign emails were then provided to GPT-4, which was tasked with formulating prompts needed for replicating the

Figure 10: Example of a prompt generated by GPT 4 to replicate the phishing email provided in the input. (Email message is truncated for brevity)

Figure 11: Email generated by Claude with prompt generated in Figure 10 as input.

original emails. To further validate the accuracy of the generated prompts, we manually assessed 100 phishing prompts alongside 100 benign ones and found that GPT-4 had a perfect score for generating such prompts. We then introduced these prompts to different LLMs, GPT-3.5T, GPT-4, Claude, and Bard, to analyze their respective outputs. An example of a phishing email generated by Claude can be viewed in Figure 11.

5.1 Evaluation of LLM-generated emails

The complexity of LLM-generated phishing websites required manual evaluation in Section 4.4. On the other hand, email generation, being a more conventional domain of text generation tasks, provides the opportunity for algorithmic evaluation. We compared the phishing emails generated by the LLMs (using the prompts that they themselves had generated) with the human-constructed phishing emails from eCrimeX . We employed four popular metrics utilized for text generation tasks:

BLEU [84], Rouge [63], Perplexity [36], and Topic Coherence [89] to measure and compare the performance of the LLMs in generating phishing email text. A short description of the metrics is provided in Section 8.4 in the Appendix.

Table 5: Effectiveness of LLM-generated emails (n=2,109)

Model	BLEU	Rouge-1	Perplexity	Topic Coherence
GPT 3.5T	0.47	0.60	22	0.63
GPT 4	0.54	0.68	15	0.72
Claude	0.51	0.65	18	0.69
Bard	0.46	0.58	20	0.62

As illustrated in Table 5, we find that GPT-4 outperforms the other models across all metrics, showcasing the highest BLEU (0.54), Rouge-1 (0.68), and Topic Coherence (0.72) scores, and the lowest Perplexity (15). Claude closely follows, with competitive scores in all metrics, demonstrating its effective balance in generating coherent and contextually appropriate emails. On the other hand, GPT 3.5T exhibits moderate performance, with BLEU and Topic Coherence scores lagging behind GPT-4 and Claude but outdoing Bard. Its Rouge-1 score is only slightly behind Claude and GPT-4, indicating its competency in information retention. Bard, presents slightly lower metrics compared to the rest but still showcases proficiency, unlike its performance towards generating phishing websites as seen earlier. In summary, all LLMs, despite exhibiting varying competencies, appear to be proficient in generating phishing emails.

6 Phishing Prompts Detection

Findings from the previous sections indicate that commercial LLMs can be utilized for generating phishing websites using malicious prompts. Thus there is a need for the swift detection of these prompts to safeguard the integrity and security of these models. To address this issue, we propose a framework, as illustrated in Figure 12, for detecting phishing prompts with three different detection schemes. We examine the prompts individually, as an entire collection, and as subset of prompts to accommodate real-time scenarios. For each detection scheme, we explain the groundtruth creation and model performance at each stage, along with the rationale behind transitioning across different detection schemes.

6.1 Data Collection

As illustrated in section 4.5, a series or a collection of prompts can be automatically generated using these LLMs, which could result in code capable of creating a phishing website. Among the chatbots we investigated, we found that ChatGPT was the only LLM with an API that facilitates data collection for our purposes. Due to this limitation, we chose only OpenAI API [79] to proceed with data collection. Two models, GPT-3.5T [77] and GPT-4 [78], have been generating

Figure 12: Framework showing three Detection Schemes

these prompt collections. We focused on generating prompt collections that can incorporate all potential attacks, thus enhancing the model’s capability to efficiently detect prompts related to any attack type listed in Table 1. With the help of the prompt generation method outlined in Section 4.5, we generated 117 prompt collections using the model GPT-3.5 and 141 prompt collections using the model GPT-4. From the collections generated, we observed that the average number of prompts within each unique collection is approximately

9.27. To have a balanced dataset regarding collections, we generated 258 benign prompt collections using OpenAI API. We apply the same method mentioned in Section 4.5 to generate these collections using models GPT-3.5 and GPT-4, with an input benign in nature.

6.2 Codebook Creation

To train our models using a groundtruth dataset, Coder 1 and Coder 2 utilized an open-coding technique. They manually labeled 2,392 prompts—sourced from GPT-3.5 and GPT-4 as either “Phishing” or “Benign.” Given the large size of the dataset, Coder 1 took the initiative by randomly selecting 40 prompts from each of the eight attack categories. This initiative aimed to discern underlying themes crucial for developing a detailed codebook. The codebook then classified elements as “Phishing” or “Benign,” contingent upon the inherent risk and intent related to phishing activities. Alongside each categorization, the codebook provides descriptions and examples for clarity. It is noteworthy that the codebook emphasized several techniques with a malicious inclination often associated with phishing. For instance, “Data Redirection” and “URL Randomization” were marked as “Phishing,” whereas legitimate web design elements like “Typography and Font” were labeled “Benign.”

Both coders utilized this codebook to label the entire dataset. Initially, Coder 1 identified 29 unique themes. The first pass on the dataset yielded a Cohen’s Kappa inter- reliability score of 0.71, signifying a substantial agreement between the coders. As they tried to resolve their disagreements, six additional themes were identified for the codebook, expanding the size of the codebook to 35 features. Disagreements between the coders were successfully addressed. We provide our codebook in Table 10 in the in the Appendix .

6.3 Common Groundtruth Creation

To create a common groundtruth dataset for all the detection schemes, initially, we extracted the prompts from each prompt collection and stored them in the form of individual prompts. Upon inspecting these prompts, we frequently observed the presence of extraneous elements such as bullet points, numerical values, and descriptors like step-1 or prompt-1. As these elements were irrelevant to the core content of the prompts, we removed them by preserving the fundamental sentences in the prompts. We stored them with attributes like collection number and prompt number to preserve the order of prompt, prompt, and version to specify which GPT model has generated these prompts. Leveraging the codebook, two independent coders manually assigned labels to each prompt across all the prompt collections. Each prompt was labeled either as malicious or benign. This process resulted in the labeling of nearly 2,392 prompts in total, where the number of prompts labeled as malicious was 1,255, and the number of prompts labeled as benign was 1,137. As you can see, not all prompts in a phishing prompt collection are malicious and this data labeling helped us to identify them.

We combined these prompt collections with additional benign prompt collections. In a similar fashion, we extracted benign prompts from the benign prompt collections and labeled all of them as benign. This resulted in 1,986 benign prompts across benign prompt collections.

6.3.1 Analysis of Annotated Prompts

We generated heatmaps to visualize human annotators’ evaluations of prompts generated by GPT-3.5T in Figure 13a and GPT-4 in Figure 13b. In each, the x-axis represents prompt numbers, indicating the prompt position within a collection, while the y-axis corresponds to 8 different attacks listed in Table 1. The color gradient represents the average label the annotators assign to each prompt position in a collection. Darker colors indicate more prompts in this position are labeled as

Figure 13: Heatmaps showing distribution of malicious prompts in collections

malicious and lighter colors indicate fewer prompts in this position are labeled as malicious.

These two heatmaps show the distribution of malicious prompts in our phishing collections. We observe that in some attack types generated by GPT-3.5 (Figure13a), such as at- tacks 1–4, most of the prompts in collections are labeled as malicious, whereas in other attacks, including attacks 5–8, only a small portion of prompts in each prompt collection are labeled as malicious and interestingly, they tend to appear close to the end of the collection. Interestingly, however, Figure 13b shows a more uniform distribution of malicious prompts in collections generated by GPT-4, as in most attack types, the malicious prompts could have appeared in almost any position. We observe two exceptions, attacks 5 and 6, where many prompts and positions are labeled as benign.

6.4 Individual Prompt Detection

The first step towards tackling the challenge of detection involves categorizing an individual prompt as either malicious or benign. To achieve this, we designed a binary classifier using pre-trained language models.

Groundtruth for Individual Prompt Detection: We selected both malicious and benign prompt collections, ensuring each individual prompt is labeled, from our common groundtruth dataset. Upon merging the two collections, we observed a total of 1,255 malicious prompts and 3,123 benign prompts. Recognizing this imbalance in the distribution of prompts, we opted to exclude some benign prompt collections. In total, we considered 50 benign prompt collections along with 258 malicious prompt collections. We split this dataset into 70% for training, 20% for testing, and 10% for validation, by maintaining the balance mentioned above.

Model Selection and Experiments: We acknowledge the effectiveness of traditional ML algorithms, such as Naïve Bayes [26] and SVM [68] in similar domains. However, these algorithms often demand large datasets with a substantial number of features to perform optimally. In our case, we have constraints of limited data and a lack of extensive features, which steers us towards selecting pre-trained language models for accomplishing this task.

Moreover, pre-trained language models like BERT [32], RoBERTa [67], etc., are trained on vast amounts of data, facilitating them with a broad understanding of language, which is crucial in detecting nuanced and occasionally hidden malicious intent in prompts. We could use several pre-trained language models such as BERT-based and Generative Pre-trained models. Generative models are unidirectional and more suit- able for tasks that involve generating text. On the other hand, BERT-based models are bidirectional, allowing them to take both left and right context when making predictions. This feature makes them more suitable for text classification tasks. Based on these advantages, we experiment with BERT- based models, including BERT [32], DistilBERT [90], RoBERTa [67], Electra [23], DeBERTa [45] and XL- NET [106]. Each model has its own advantages and disadvantages, which we consider along with their performance metrics. A brief description of different models, and their details related to size and parameters, are provided in Section 8.3 in the Appendix.

Training Details: We used pre-trained versions of all the listed models from Hugging Face Transformers Library [104]. We finetuned these models on our ground truth dataset for 10 epochs with a batch size of 16. We used AdamW optimizer, and the learning rates were set to 2e-5. The maximum sequence length is set to 512. We finetune these models using V100 GPU and used the last model checkpoint for evaluation. For obtaining embeddings for input sequences, we used their respective tokenizers.

Performance Evaluation: To select the best model, we scrutinize the metrics such as average F1 score, Accuracy, Precision and Recall. Furthermore, we compute the Total Time for predicting 100 samples and Median Prediction Time across 100 samples. Given our objective to deploy our model in real-time scenarios, where users prompt with a high speed, these metrics are necessary for evaluation. Table 6 shows the performance of the models on our test set. We observe that RoBERTa shows slightly better performance, with an average F1 score of 0.95. Although there are lighter models such as DistilBERT and ELECTRA, which have slightly lesser median prediction times compared to RoBERTa, we noticed that their F1 scores are slightly lower, hovering around 0.93. Considering RoBERTa’s powerful training approach and best performance across all the models, we select RoBERTa as our final model for individual prompt detection.

Challenges with Individual prompt classification: There are several scenarios where individual prompt classification might not be sufficient. For example, an individual prompt might not provide complete information about user’s intent. Adaptive attackers may engage in deep conversations with the models to effectively accomplish their task. And scenarios may appear where individual prompts look benign, but the entire conversation can lead to malicious outcomes. Depending solely on an individual prompt classifier in such cases might offer a leeway for malicious users to elude detection. Such scenarios strongly demand the need for a solid detection mechanism that goes beyond analyzing individual prompts. To achieve this, we perform classification on whole collection of prompts, using the classifier trained on individual prompts.

6.5 Phishing Collection Detection

Acknowledging the limitations of depending solely on individual prompt analysis, we additionally evaluate a complete collection of prompts. By analyzing entire collection of prompts, our objective is to obtain a broader understanding of conversational context. The main objective of this classification scheme is to evaluate model’s performance when confronted with an entire collection consisting of multiple sentences. To implement this, we employ the same RoBERTa classifier trained for individual prompts detection.

Groundtruth for Whole collection Detection: To test the capability of our individual classifier for detecting malicious collections, we created a balanced test set with 50 malicious and 53 benign collections. The collection label has a value 1 for malicious prompt collections and 0 for benign collections. Performance Evaluation: To evaluate the individual prompt classifier performance on entire collection of prompts, we presented collections of prompts during testing. To obtain collections, we concatenate all the prompts within a collection, based on their prompt numbers, to form a single paragraph. This consolidated input is then pre-processed and presented to the model to obtain a prediction. Table 7 shows the performance of model. We can observe that RoBERTa achieves 96.1% accuracy, with an average F1 score of 0.96. Thus even though the model is trained on individual prompts, it could effectively detect malicious collections.

Challenges with Whole Collection Classification: Based on the results obtained after evaluating model’s performance with entire collection, we can observe that conversations pro- vide more nuanced understanding of users intent compared to evaluating individual prompts separately. This comprehensive approach increases the chance of detecting malicious activity. However, obtaining and testing an entire prompt collection is not feasible in real-time, where users interact with chat- bots one prompt at a time. To adapt to real-time scenarios, we propose to examine the current prompt alongside with its preceding prompts, to ascertain if the context captured unveils any malicious activity.

6.6 Phishing Prompt Subset Detection- Real Time Scenario

Building on insights gained from whole collection classification and recognizing the practical constraints in real-time scenarios where prompts do not appear as a whole collection, but appear one after the other, we evaluate the performance of the classifier on subset of prompts. In this analysis, we aim to observe evolving users intent. This procedure entails analyzing a benign prompt alongside its preceding prompts within the same collection. Using this approach, our main objective is to understand sequence of interconnected prompts for an early detection of malicious activity, in ongoing dialogues. To classify and evaluate such prompt subsets, we add new attributes to the dataset using our proposed heuristic.

Groundtruth for Prompt Subset classification: First, we took the common groundtruth dataset, with these attributes: Collection Number, Prompt Number, Prompt, Prompt Label and Version. We also incorporated a new attribute named “Prompt-Subset Label,” by concatenating the current prompt with previous set of prompts. We used the prompt number in each collection to correctly determine the order of prompts and concatenated accordingly. We then store these concatenated prompts with a new attribute named as “Prompts- Concatenated”. Utilizing the labels for individual prompts, two coders have labelled prompt subsets as well. We labeled prompts subsets at each level. Subsequently, we checked for the distribution of individual prompts again using the “Prompt- Subset Label.” Based on this label, the number of malicious and benign prompts are 1,958 and 2,420, respectively.

Performance Evaluation: Our goal is to assess the model’s performance across various prompt subsets. To accomplish this, we introduced different prompt subsets to the finetuned model during the testing phase. We then evaluated the model’s predictions using the “Prompt-Subset Label”. This process allowed us to analyze the model’s performance at each specific level of the prompt subsets. We used the same RoBERTa classifier employed for individual prompt detection, and proceeded with testing prompt subsets. Along with this, we also trained the same model with prompt subsets. However, we obtained similar results. As training a classifier with individual prompts is less complex when compared to training a classifier with subsets of prompts, we chose to use the classifier for individual prompt detection.

Table 8 shows the performance of model on test set containing subsets of prompts with their respective labels. From the results, we can observe that RoBERTa achieves 98.0% accuracy, with an average F1 score of 0.98. This indicates

Table 6: Performance metrics for different models

Model	Accuracy	Precision	Recall	F1 Score	Total Time	Prediction Time – Median
BERT-base	94.01%	0.94	0.94	0.94	85.40s	0.85s
DistilBERT	93.15%	0.93	0.93	0.93	42.86s	0.43s
RoBERTa-base	94.52%	0.95	0.95	0.95	82.57s	0.82s
DeBERTa	93.66%	0.94	0.94	0.94	141.34s	1.41s
XLNET	93.84%	0.94	0.94	0.94	120.43s	1.21s
ELECTRA	92.64%	0.93	0.93	0.93	16.73s	0.17s

Table 7: Performance of Model on Whole Collections

Accuracy	Precision	Recall	F1 Score
96.2%	0.96	0.96	0.96

that despite the model being trained on individual prompts, it exhibits a strong performance in identifying malicious nature within the subsets of prompts.

Table 8: Performance of Model on Subsets of Prompts

Accuracy	Precision	Recall	F1 Score
98.0%	0.98	0.98	0.98

After evaluating all the outcomes from three detection schemes, it is evident that model effectively categorizes entire collections and prompt subsets. However, due to practical challenges associated with processing entire collections in real time, prompt subsets detection emerges as the best choice for early and efficient real time detection.

6.7 Detecting Phishing email prompts

To automatically detect phishing email generation prompts, we utilized the RoBERTa architecture and trained it on the sample of 2,109 phishing prompts that were generated by GPT-4 from the eCrimeX phishing dataset and 2,109 Benign email prompts generated by the same from the Enron dataset, partitioning the dataset into a 70:30 Train:Test split. The performance of our model is as illustrated in Table 9. The model achieved an accuracy of 94%, with precision standing at 95%. Overall, these metrics highlight the model’s robust capability in the early detection of prompts that attempt to generate phishing emails using LLMs.

Table 9: Performance of our email prompt detection model

Accuracy	Precision	Recall	F1 Score
94%	95%	93%	94%

7 Conclusion

7.1 Implications

Our research indicates that readily available commercial LLMs can effectively generate phishing websites and emails. These LLMs are not only capable of being manually directed to initiate attacks but can also autonomously generate phishing prompts. These AI-created prompts are adept at producing phishing content that can evade current anti-phishing tools as effectively as human-generated content. Moreover, phishing emails derived from LLM-generated prompts can mimic authentic human phishing attempts with high accuracy. The potential misuse of LLMs for phishing poses a serious threat, as attackers can refine and reuse a small set of prompts to create a vast number of sophisticated phishing attacks. How- ever, we have developed a machine learning model that can detect these malicious prompts early on, which is crucial in preempting the production of harmful content by LLMs. Our model, which demonstrates strong performance in identifying phishing prompts for both websites and emails, could be integrated with LLMs as a third-party plugin.

7.2 Ethics and Data Sharing

Since ChatGPT 3.5T and 4 were used to generate the phishing prompts, we have disclosed such prompts to their developer, OpenAI [79], and we plan to publicly disclose them after OpenAI’s mandatory 90 day period of vulnerability disclosure [1]. We also disclose the vulnerabilities to the developers of Claude and Bard i.e.Anthropic, and Google and are awaiting their feed- back. Meanwhile our model can be tested on Hugging Face at https://huggingface.co/phishbot/Isitphish where users can try out different prompts to check if they have phishing intention towards creating malicious websites or emails. Our dataset and framework is also available upon request.

References

[1] Bugcrowd. https://bugcrowd.com/openai.
[2] Hostinger. https://www.hostinger.com/.
[3] Google Safebrowsing. https://safebrowsing.google.com/, 2020.

[4] PhishTank. https://www.phishtank.com/faq.php, 2020.
[5] VirusTotal. https://www.virustotal.com/gui/home/, 2020.
[6] Mcafee WebAdvisor. https://www.mcafee.com/en-us/ safe-browser/mcafee-webadvisor.html, 2022.
[7] ADOBE. Responsive web design. https://xd.adobe.com/ideas/ principles/web-design/responsive-web-design-2/, July 2021. [Accessed on 9 March 2023].
[8] AKHAWE, D., AND FELT, A. P. Alice in warningland: A large-scale field study of browser security warning effectiveness. In Presented as part of the 22nd {USENIX} Security Symposium ({USENIX} Security 13) (2013), pp. 257–272.
[9] ALABDAN, R. Phishing attacks survey: Types, vectors, and technical approaches. Future internet 12, 10 (2020), 168.
[10] ALEROUD, A., AND ZHOU, L. Phishing environments, techniques, and countermeasures: A survey. Computers & Security 68 (2017), 160–196.
ALKHALIL, Z., HEWAGE, C., NAWAF, L., AND KHAN, I. Phishing attacks: A recent comprehensive study and a new anatomy. Frontiers in Computer Science 3 (2021), 563060.
[12] ALPER, K., AND COHEN, I. Opwnai: Cybercriminals starting to use gpt for impersonation and social engineering. Check Point Research (March 2023).
[13] ANTHROPIC. Claude-intro, 2023.
[14] APWG. ecrimex. https://apwg.org/ecx/.
[15] AUTH0. Preventing clickjacking attacks, June 2021.
[16] BIJMANS, H., BOOIJ, T., SCHWEDERSKY, A., NEDGABAT, A., AND VAN WEGBERG, R. Catching phishers by their bait: Investigating the dutch phishing landscape through phishing kit detection. In 30th USENIX Security Symposium (USENIX Security 21) (2021), pp. 3757–3774.
BISWAS, S. Chatgpt and the future of medical writing, 2023.
[18] BLOG, S. Dissecting a phishing campaign with a captcha-based url. Trustwave (March 2021).
[19] BOOTSTRAP. Bootstrap. https://getbootstrap.com/, 2023. [Accessed on 9 March 2023].
[20] BUTAVICIUS, M., TAIB, R., AND HAN, S. J. Why people keep falling for phishing scams: The effects of time pressure and deception cues on the detection of phishing emails. Computers & Security 123 (2022), 102937.
[21] CARAMANCION, K. M. Harnessing the power of chatgpt to decimate mis/disinformation: Using chatgpt for fake news detection. In 2023 IEEE World AI IoT Congress (AIIoT) (2023), IEEE, pp. 0042–0046.
[22] CIDON, A., GAVISH, L., BLEIER, I., KORSHUN, N., SCHWEIGHAUSER, M., AND TSITKIN, A. High precision detection of business email compromise. In 28th USENIX Security Symposium (USENIX Security 19) (Santa Clara, CA, Aug. 2019), USENIX Association, pp. 1291–1307.
[23] CLARK, K., LUONG, M.-T., LE, Q. V., AND MANNING, C. D. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555 (2020).
[24] COFENSE. Global polymorphic phishing attack 2022. https://bit. ly/3ZVtu4t, March 2022. [Accessed: March 9, 2023].
[25] COHEN, L. Chatgpt hack allows chatbot to generate malware, June 2021.
DAI, W., XUE, G.-R., YANG, Q., AND YU, Y. Transferring naive bayes classifiers for text classification. In AAAI (2007), vol. 7, pp. 540– 545.
DAS, M., PANDEY, S. K., AND MUKHERJEE, A. Evaluating chat-gpt’s performance for multilingual and emoji-based hate speech de- tection. arXiv preprint arXiv:2305.13276 (2023).
[28] DE ANGELIS, L., BAGLIVO, F., ARZILLI, G., PRIVITERA, G. P., FERRAGINA, P., TOZZI, A. E., AND RIZZO, C. Chatgpt and the rise of large language models: the new ai-driven infodemic threat in public health. Frontiers in Public Health 11 (2023), 1166120.
[29] DEIANA, G., DETTORI, M., ARGHITTU, A., AZARA, A., GABUTTI, G., AND CASTIGLIA, P. Artificial intelligence and public health: Evaluating chatgpt responses to vaccination myths and misconceptions. Vaccines 11, 7 (2023), 1217.
[30] DERNER, E., AND BATISTICˇ , K. Beyond the safeguards: Exploring the security risks of chatgpt. arXiv preprint arXiv:2305.08005 (2023).
[31] DEVELOPERS, G. recaptcha v3: Add the recaptcha script to your html or php file. https://developers.google.com/recaptcha/ docs/display, September 2021. [Online; accessed 9-March-2023].
[32] DEVLIN, J., CHANG, M.-W., LEE, K., AND TOUTANOVA, K. Bert: Pre-training of deep bidirectional transformers for language under- standing. arXiv preprint arXiv:1810.04805 (2018).
[33] DOE, J. ChatGPT vs Microsoft Copilot: The major differences. UC Today (2023).
[34] DOE, J. The phishing landscape 2023. Tech. rep., Interisle Consulting Group, 2023.
[35] DOWNS, J. S., HOLBROOK, M., AND CRANOR, L. F. Behavioral response to phishing risk. In Proceedings of the anti-phishing working groups 2nd annual eCrime researchers summit (2007), pp. 37–44.
[36] DUTTA, P. Perplexity of language models. Medium (2021).
[37] ERKKILA, J. Why we fall for phishing. In Proceedings of the SIGCHI conference on Human Factors in Computing Systems CHI 2011 (2011), ACM, pp. 7–12.
[38] FOUNDATION. Foundation. https://get.foundation/, 2023. [Accessed on 9 March 2023].
[39] FOUSS, B., ROSS, D. M., WOLLABER, A. B., AND GOMEZ, S. R. Punyvis: A visual analytics approach for identifying homograph phishing attacks. In 2019 IEEE Symposium on Visualization for Cyber Security (VizSec) (2019), IEEE, pp. 1–10.
[40] GOOGLE. Bard-google-ai, 2023.
[41] GRESHAKE, K., ABDELNABI, S., MISHRA, S., ENDRES, C., HOLZ, T., AND FRITZ, M. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection, 2023.
GUPTA, M., AKIRI, C., ARYAL, K., PARKER, E., AND PRAHARAJ, L. From chatgpt to threatgpt: Impact of generative ai in cybersecurity and privacy. IEEE Access (2023).
[43] HAN, X., KHEIR, N., AND BALZAROTTI, D. Phisheye: Live monitoring of sandboxed phishing kits. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (2016), pp. 1402–1413.
[44] HE, D., LV, X., ZHU, S., CHAN, S., AND CHOO, K.-K. R. A method for detecting phishing websites based on tiny-bert stacking. IEEE Internet of Things Journal (2023).
[45] HE, P., LIU, X., GAO, J., AND CHEN, W. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654 (2020).
[46] HO, G., CIDON, A., GAVISH, L., SCHWEIGHAUSER, M., PAXSON, V., SAVAGE, S., VOELKER, G. M., AND WAGNER, D. Detecting and characterizing lateral phishing at scale. In 28th USENIX Security Symposium (USENIX Security 19) (Santa Clara, CA, Aug. 2019), USENIX Association, pp. 1273–1290.
[47] HOFFMAN, C. It’s scary easy to use chatgpt to write phishing emails. CNET (October 2021).
[48] JAIN, A. K., AND GUPTA, B. A survey of phishing attack techniques, defence mechanisms and open research challenges. Enterprise Infor- mation Systems 16, 4 (2022), 527–565.
[49] JALIL, S., RAFI, S., LATOZA, T. D., MORAN, K., AND LAM, W. Chatgpt and software testing education: Promises & perils. arXiv preprint arXiv:2302.03287 (2023).
[50] KAN, M. Fbi: Hackers are compromising legit qr codes to send you to phishing sites. PCMag (May 2022). [Online; accessed 9-March- 2023].
[51] KANG, D., LI, X., STOICA, I., GUESTRIN, C., ZAHARIA, M., AND HASHIMOTO, T. Exploiting programmatic behavior of llms: Dual-use through standard security attacks. arXiv preprint arXiv:2302.05733 (2023).
[52] KANG, L., AND XIANG, J. Captcha phishing: A practical attack on human interaction proofing. In Proceedings of the 5th international conference on Information security and cryptology (2009), pp. 411–425.
KANG, L., AND XIANG, J. Captcha phishing: a practical attack on human interaction proofing. In Information Security and Cryp- tology: 5th International Conference, Inscrypt 2009, Beijing, China, December 12-15, 2009. Revised Selected Papers 5 (2010), Springer, pp. 411–425.
[54] KARANJAI, R. Targeted phishing campaigns using large scale lan- guage models. arXiv preprint arXiv:2301.00665 (2022).
[55] KARKI, B., ABRI, F., NAMIN, A. S., AND JONES, K. S. Using transformers for identification of persuasion principles in phishing emails. In 2022 IEEE International Conference on Big Data (Big Data) (2022), IEEE, pp. 2841–2848.
[56] KLIMT, B., AND YANG, Y. The enron corpus: A new dataset for email classification research. In European conference on machine learning (2004), Springer, pp. 217–226.
[57] KOVACS, E. Malicious prompt engineering with ChatGPT, September 2021.
LAI, F. The carbon footprint of GPT-4. Towards Data Science (2022).
[59] LAM, I.-F., XIAO, W.-C., WANG, S.-C., AND CHEN, K.-T. Counteracting phishing page polymorphism: An image layout analysis approach. In Advances in Information Security and Assurance: Third International Conference and Workshops, ISA 2009, Seoul, Korea, June 25-27, 2009. Proceedings 3 (2009), Springer, pp. 270–279.
[60] LI, H., GUO, D., FAN, W., XU, M., AND SONG, Y. Multi-step jail-breaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197 (2023).
[61] LIANG, B., SU, M., YOU, W., SHI, W., AND YANG, G. Cracking classifiers for evasion: A case study on the google’s phishing pages filter. In Proceedings of the 25th International Conference on World Wide Web (2016), pp. 345–356.
[62] LIGHT, B. T. https://www.bitdefender.com/solutions/ trafficlight.html.
[63] LIN, C.-Y. ROUGE: A Package for Automatic Evaluation of Summaries, 2004.
[64] LIU, J., XIA, C. S., WANG, Y., AND ZHANG, L. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210 (2023).
[65] LIU, R., LIN, Y., YANG, X., NG, S. H., DIVAKARAN, D. M., AND DONG, J. S. Inferring phishing intention via webpage appearance and dynamics: A deep vision based approach. In 30th {USENIX} Security Symposium ({USENIX} Security 21) (2022).
[66] LIU, Y., DENG, G., LI, Y., WANG, K., ZHANG, T., LIU, Y., WANG, H., ZHENG, Y., AND LIU, Y. Prompt injection attack against llm- integrated applications. arXiv preprint arXiv:2306.05499 (2023).
[67] LIU, Y., OTT, M., GOYAL, N., DU, J., JOSHI, M., CHEN, D., LEVY, O., LEWIS, M., ZETTLEMOYER, L., AND STOYANOV, V. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
[68] LIU, Z., LV, X., LIU, K., AND SHI, S. Study on svm compared with the other text classification methods. In 2010 Second international workshop on education technology and computer science (2010), vol. 1, IEEE, pp. 219–222.
[69] MINITOOL. ChatGPT: This content may violate our content policy, February 2022.
[70] MORGAN, M. Qr code phishing scams target users and enterprise organizations. Security Magazine (October 2021). [Online; accessed 9-March-2023].
[71] MRD0X. Browser in the Browser: Phishing Attack. https:// mrd0x.com/browser-in-the-browser-phishing-attack/, Jan- uary 2022. Accessed on April 28, 2023.
[72] ODEH, A., KESHTA, I., AND ABDELFATTAH, E. Machine learningtechniquesfor detection of website phishing: A review for promises and challenges. In 2021 IEEE 11th Annual Computing and Communication Workshop and Conference (CCWC) (2021), IEEE, pp. 0813–0818.
OEST, A., SAFAEI, Y., ZHANG, P., WARDMAN, B., TYERS, K., SHOSHITAISHVILI, Y., AND DOUPÉ, A. Phishtime: Continuous longitudinal measurement of the effectiveness of anti-phishing black- lists. In 29th {USENIX} Security Symposium ({USENIX} Security 20) (2020), pp. 379–396.
[74] OEST, A., SAFEI, Y., DOUPÉ, A., AHN, G.-J., WARDMAN, B., AND WARNER, G. Inside a phisher’s mind: Understanding the anti- phishing ecosystem through phishing kit analysis. In 2018 APWG Symposium on Electronic Crime Research (eCrime) (2018), IEEE, pp. 1–12.
[75] OEST, A., ZHANG, P., WARDMAN, B., NUNES, E., BURGIS, J., ZAND, A., THOMAS, K., DOUPÉ, A., AND AHN, G.-J. Sunrise to sunset: Analyzing the end-to-end life cycle and effectiveness of phishing attacks at scale. In 29th USENIX Security Symposium (USENIX Security 20) (2020).
[76] OPENAI. Openai usage policies, 2021.
[77] OPENAI. Openai gpt-3.5 models, 2022.
[78] OPENAI. Gpt-4 technical report, 2023.
[79] OPENAI. Openai api, 2023.
[80] OPENPHISH. Phishing feed. “https://openphish.com/faq. html”.
OSWALD, C., SIMON, S. E., AND BHATTACHARYA, A. Spotspam: Intention analysis–driven sms spam detection using bert embeddings. ACM Transactions on the Web (TWEB) 16, 3 (2022), 1–27.
[82] OTIENO, D. O., NAMIN, A. S., AND JONES, K. S. The application of the bert transformer model for phishing email classification. In 2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC) (2023), IEEE, pp. 1303–1310.
[83] PALO ALTO NETWORKS UNIT 42. Captcha-protected phishing: What you need to know. https://unit42.paloaltonetworks. com/captcha-protected-phishing/, June 2021. [Accessed: March 9, 2023].
[84] PAPINENI, K., ROUKOS, S., WARD, T., AND ZHU, W.-J. Bleu: a method for automatic evaluation of machine translation, 2002.
[85] PORTSWIGGER. Same-origin policy. https://portswigger.net/ web-security/cors/same-origin-policy, 2023. [Online; accessed 9-March-2023].
[86] QADIR, J. Engineering education in the era of chatgpt: Promise and pitfalls of generative ai for education.
[87] QRCODE MONKEY. QR Server. https://www.qrserver.com/, Accessed on March 8, 2023.
[88] RIFAT, N., AHSAN, M., CHOWDHURY, M., AND GOMES, R. Bert against social engineering attack: Phishing text detection. In 2022 IEEE International Conference on Electro Information Technology (eIT) (2022), IEEE, pp. 1–6.
[89] ROSNER, F., HINNEBURG, A., RÖDER, M., NETTLING, M., AND BOTH, A. Evaluating topic coherence measures. arXiv preprint arXiv:1403.6397 (2014).
[90] SANH, V., DEBUT, L., CHAUMOND, J., AND WOLF, T. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).
[91] SHEN, X., CHEN, Z., BACKES, M., SHEN, Y., AND ZHANG, Y. ” do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825 (2023).
[92] SHKATOV, M. Chatting our way into creating a polymorphic malware, January 2018.
[93] SOFTWARE, C. What is phishing?
[94] SOUTHERN, M. Chatgpt examples: 5 ways businesses are using openai’s language model, 2021.
[95] TEAM, P. T. I. Have a money latte? then you too can buy a phish kit, 2023.
[96] TEAM, S. iframe injection attacks and mitigation. SecNHack (Febru- ary 2022). [Online; accessed 9-March-2023].
[97] TOUVRON, H., LAVRIL, T., IZACARD, G., MARTINET, X., LACHAUX, M.-A., LACROIX, T., ROZIÈRE, B., GOYAL, N., HAMBRO, E., AZHAR, F., ET AL. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
[98] TUCKER, T. A consumer-protection agency warns that scammers are using ai to make their schemes more convincing and dangerous. Business Insider (March 2023).
[99] TUFANO, M., CHANDEL, S., AGARWAL, A., SUNDARESAN, N., AND CLEMENT, C. Predicting code coverage without execution. arXiv preprint arXiv:2307.13383 (2023).
[100] VARSHNEY, G., MISRA, M., AND ATREY, P. K. A survey and classification of web phishing detection schemes. Security and Communication Networks 9, 18 (2016), 6266–6284.
[101] VENTURES, C. Beware of lookalike domains in punycode phishing attacks. Cybersecurity Ventures (2019).
[102] VIDAS, T., OWUSU, E., WANG, S., ZENG, C., CRANOR, L. F., AND CHRISTIN, N. Qrishing: The susceptibility of smartphone users to qr code phishing attacks. In Financial Cryptography and Data Security: FC 2013 Workshops, USEC and WAHC 2013, Okinawa, Japan, April 1, 2013, Revised Selected Papers 17 (2013), Springer, pp. 52–69.
[103] WANG, Y., ZHU, W., XU, H., QIN, Z., REN, K., AND MA, W. A large-scale pretrained deep model for phishing url detection. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2023), IEEE, pp. 1–5.
[104] WOLF, T., DEBUT, L., SANH, V., CHAUMOND, J., DELANGUE, C., MOI, A., CISTAC, P., RAULT, T., LOUF, R., FUNTOWICZ, M., ET AL. Transformers: State-of-the-art natural language processing. In Pro- ceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations (2020), pp. 38–45.
[105] WU, C., ZHANG, X., ZHANG, Y., WANG, Y., AND XIE, W. Pmc-llama: Further finetuning llama on medical papers. arXiv preprint arXiv:2304.14454 (2023).
[106] YANG, Z., DAI, Z., YANG, Y., CARBONELL, J., SALAKHUTDINOV, R. R., AND LE, Q. V. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems 32 (2019).
[107] YUNXIANG, L., ZIHAN, L., KAI, Z., RUILONG, D., AND YOU, Z. Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge. arXiv preprint arXiv:2303.14070 (2023).
[108] ZHANG, P., OEST, A., CHO, H., SUN, Z., JOHNSON, R., WARDMAN, B., SARKER, S., KAPRAVELOS, A., BAO, T., WANG, R., ET AL. Crawlphish: Large-scale analysis of client-side cloaking techniques in phishing. In 2021 IEEE Symposium on Security and Privacy (SP) (2021), IEEE, pp. 1109–1124.
[109] ZHONG, L., AND WANG, Z. A study on robustness and relia- bility of large language model code generation. arXiv preprint arXiv:2308.10335 (2023).

8 Appendix

The appendix includes materials that are complementary to the main content of the paper.

8.1 Image interpretation prompt

As of October 4, 2023, users can include images into GPT 4 prompts to generate the desired content. We discovered that providing images of login forms of legitimate brands can prompt GPT-4 to emulate these designs- which can result in the generation of phishing attacks. The format for these prompts can be likened to our Phishing prompt generation, as illustrated in Figure 3. However, there would be no need to include prompts related to website design emulation since the design is inferred directly from the provided screenshot. An instance of such a potential attack, using a login form screenshot as a trigger, is depicted in Figure 14. For proactive detection of potentially malicious prompts, the prompt detection model we presented in Sec 6 offers a solution for textual content. Concurrently, there are numerous machine learning models in existence that can determine the possible intent behind a phishing website, looking for cues like logos or the presence and positioning of login fields [?, 65]. By integrating our detection model with any of these models, we can enhance early detection capabilities for image-based prompts.

Figure 14: Phishing website generated using image-based prompts in GPT 4.

8.2 Distribution of attack vectors in phishing email prompts

In Section 5, we utilized 2,109 phishing emails from APWG eCrimeX [14] to generate prompts using GPT 4, that were then provided back to all LLMs for generating new email phishing attacks. Figure 15 provides a closer look at all such attack vectors in the prompts.

8.3 Pre-trained models evaluated for detection

In this section, we provide an overview of the six deep learning models that were evaluated for building our phishing prompt detection tool. Firstly, BERT [32] set the precedent with its architecture of 12 layers, 768 hidden units, 12 attention heads, and 110 million parameters. BERT was pre-trained on a vast corpus, distinguishing itself by deeply understanding bidirectional context through conditioning on both sides of a given token. Masked Language Modeling, which masks 15% of words in the input, and next-sentence prediction tasks are central to its training. RoBERTa [67] shares BERT’s structural design but expands its parameter count to 125 million, tweaking key hyperparameters in the process. It notably de- parts from BERT by removing the next-sentence prediction objective, opting instead for training with larger mini-batches and learning rates—a change that has yielded improvements in performance. DistilBERT [90] offers an alternative by halving the number of layers to six while maintaining the same size in hidden units and attention heads, resulting in a model with only 66 million parameters. Despite this reduction, Distil- BERT manages to retain 97% of BERT’s performance, thanks to the process of knowledge distillation, where the model is trained to approximate BERT’s capabilities. ELECTRA [23] takes a different approach with its 12-layer architecture, reducing the hidden units to 256 and attention heads to 4, with an overall parameter count of 14 million. This model functions as a discriminator in a setup where a generator introduces fake tokens, with ELECTRA focusing on the replaced token detection task—resulting in a more efficient training process compared to its predecessors. DeBERTa [45] mirrors BERT and RoBERTa in terms of layer and head count, also featuring 768 hidden units but pushing the parameters to 125 million. It innovates with disentangled attention mechanisms and a novel positional encoding scheme. Using relative position encodings, DeBERTa has demonstrated robust performance across various GLUE tasks. Lastly, XLNet [106] aligns with BERT in terms of its 12-layer, 768-hidden units, and 12 at- tention heads, encompassing 110 million parameters. XLNet transcends BERT’s capabilities by employing a permutation-based training method, allowing it to capture bidirectional context, thus addressing some of BERT’s inherent limitations.

Figure 15: Distribution of phishing email attack vectors in the prompt dataset

8.4 Metrics used for evaluating LLM generated phishing emails

In this section, we describe the different metrics used for evaluating the robustness of phishing emails generated by emails by comparing them with their corresponding human- constructed phishing email. Firstly, BLEU [84] evaluates the LLM generated text by aligning n-grams with those of the reference text, employing a geometric mean to integrate precision scores across different n-gram lengths, adjusted by a brevity penalty to counteract short, uninformative output. It thus provides a crucial measure of the model’s ability to create contextually relevant and semantically accurate emails, with a higher BLEU score denoting an email that closely mirrors the human reference and thus holds a stronger potential to deceive potential victims. On the other hand, Rouge [63] relies heavily on recall, calculating the proportion of shared n-grams with the reference, offering variants like Rouge-1 that hinge on the longest common subsequence. This metric is integral in signifying the model’s capability to retain and effectively replicate essential content from the reference texts, which is central to composing phishing emails that are both in- formative and convincing. A higher Rouge score implies that the phishing email retains more of the important information, potentially increasing its effectiveness. Topic Coherence [89] contrasts word distribution probabilities across various text segments, using analytical tools such as LSA to ensure the narrative does not deviate from the central theme. A higher score in this metric implies a semantically well-connected text, which is crucial for maintaining thematic consistency, an aspect that improves the credibility of a phishing email and enhances its capacity to be misunderstood as being legitimate. Finally, Perplexity [36] utilizes GPT-2 embeddings to quantify a model’s predictive ability by inverting and normalizing the likelihood of the test corpus under the model, where lower perplexity indicates a model’s enhanced ability to produce sequences that are natural and expected within the language model’s learned parameters. This also measures the quality of the LLMs in generating contextually appropriate emails.

Table 10: Codebook utilized to label LLM generated prompts – Part 1

Categories	Description	Example Elements	Label
Document struc- ture	Basic HTML document structure including DOC- TYPE, html, head, and body tags	DOCTYPE, html, head, body	Not Phishing
Meta Tags	Inclusion of metadata tags like charset, viewport, and description	charset, viewport, descrip- tion	Not Phishing
Typography and Font	Choices related to text styling, fonts, and typogra- phy	Arial, Helvetica, Font size	Not Phishing
Form Design and Positioning	Styling and placement of forms, text boxes, and buttons	Form size, button design	Not Phishing
Non-sensitive UI Elements	Standard, non-sensitive user interface elements like image tags, div elements	Amazon logo, div ele- ments	Not Phishing
Server-side Script Basics	Basic structure of server-side scripts	PHP structure, if-else checks	Not Phishing
Hosting and Do- main	Information related to hosting, setting up domain	HTTPS, mydomain.com	Not Phishing
Sensitive Data Col- lection	Explicit mention of collecting sensitive information like email and password	Collecting email and pass- word	Phishing
Data Redirection	Redirecting user’s data to an external or email ad- dress	Sending email to youre- mail@email.com	Phishing
Imitating another site’s content	Explicit instructions to copy or imitate text or layout of another site	Texts from Amazon’s lo- gin page	Phishing
Misleading User Navigation	Redirecting or linking users in a way that might deceive them	Redirect to Amazon.com after email	Phishing
Design Mimicking Specific Websites	Detailed design elements that mimic the visual ap- pearance of specific, established websites like Ama- zon	Using Amazon’s font, but- tons, text boxes, form size, and Amazon logo, “cre- ate your amazon account” button design	Phishing
Fetching Text from Specific Websites	Copying text from established websites to create a similar appearance	Fetching text for email address placeholder, pass- word placeholder, buttons, and additional informa- tion from Amazon’s login page	Phishing
Form Action Point- ing to Custom PHP Script	Updating the form’s action to point to a custom PHP script	Modifying the form element so that the ’action’ attribute points to ’send_email.php’, ’submit’ type for the submit button	Not Phishing
Email and Redirect Functionality	Sending an email upon form submission and redi- recting the user	PHP code in ’send_email.php’ to send an email to a specified address using submitted data, using ’header’ function to redirect to Amazon’s webpage	Phishing
URL Randomiza- tion	Randomly modifying the website URL each time the page is loaded	Using JavaScript to gen- erate a random string and appending it to the URL	Phishing

File Naming	Naming of HTML or other code files to imitate another website’s naming conventions	amazon.com-signin.html	Phishing
Incorporating 3rd Party Brand Aes- thetics	Using inline CSS or other styling methods to make a webpage appear like it belongs to a third party, e.g., Amazon	Change font, form size, buttons, and text boxes to resemble Amazon’s aes- thetics	Phishing
Branded Imagery	Use of logos or other branded material from an ex- isting company	<img src=”/amazon- logo.png” alt=”Amazon Logo”>	Not Phishing
Content Copying	Explicit instruction or action to incorporate texts from another site	Texts such as “Email,” “Password,” and “Cre- ate your Amazon ac- count” from Amazon’s lo- gin page	Phishing
Misleading Call-to- Action	Use of misleading Call-to-Action buttons that look like they are part of another website’s UI	“Create your Amazon ac- count” button that links to the Amazon account cre- ation page	Phishing
Form Action Con- figuration	The target or action to which form data is sent	Set form action to send_email.php	Not Phishing
Data Collection and Redirection	Using server-side scripts to collect and send data to an external email or URL	Sending email to youre- mail@email.com upon form submission, then redirecting users to http://amazon.com/	Phishing
Non-Standard Input Methods	Use of unconventional methods for form inputs due to constraints	content-editable spans or divs instead of form ele- ments or input types	Not Phishing
Link Inclusion	Inclusion of multiple links within a webpage	At least three links to HTTPS webpages within your login page	Not Phishing
Security Warnings	Acknowledgment of limitations or compromises in security or functionality due to design constraints	Warning about usability and security issues due to constraints	Not Phishing

Table 11: Codebook utilized to label LLM generated prompts – Part 2

Webpage Title and Meta Information	Use of a title and meta tags in the HTML head section	Title for the webpage, nec- essary meta tags for de- vice rendering	Not Phishing
Visual Content	Use of images and their styling attributes to enhance webpage visual appeal	Image of iPhone, use of width and height attributes	Not Phishing
CSS Animations	Use of CSS keyframes and animation properties to create visual effects	Implementing animations using CSS keyframes	Not Phishing
Interactive UI Ele- ments	Design and inclusion of buttons or other interactive elements	Claim button with intu- itive design	Not Phishing
Use of iframes	Embedding content or functionality within iframes	Enclosing claim button within an iframe	Phishing

JavaScript Redirec- tion	Use of JavaScript to manipulate the browser’s loca- tion property for redirection	Modifying win- dow.location.href to redirect to Amazon login page or specified URL	Phishing
Multi-part User In- teractions	Creation of multi-part interactive elements that pro- duce different outcomes based on how they are in- teracted with	Left or right part of the button leading to different URLs	Phishing
Test Before Deployment	Mention of testing the webpage to ensure proper look and functionality	Testing your page thor- oughly before going live	Not Phishing

From Chatbots to PhishBots? – Preventing Phishing scams created using ChatGPT, Google Bard and Claude