Skip to main content
Uncategorized

A Survey of GPT-3 Family Large Language Models Including ChatGPT and GPT-4

October 4, 2023

Katikapalli Subramanyam Kalyan Akmmus AI, Trichy, India

Email: kalyan@akmmusai.pro, Website: https:// www.akmmusai.pro

Abstract—Large language models (LLMs) are a special class of pretrained language models obtained by scaling model size, pretraining corpus and computation. LLMs, because of their large size and pretraining on large volumes of text data, exhibit special abilities which allow them to achieve remarkable performances without any task-specific training in many of the natural language processing tasks. The era of LLMs started with OpenAI’s GPT-3 model, and the popularity of LLMs is increasing exponentially after the introduction of models like ChatGPT and GPT4. We refer to GPT-3 and its successor OpenAI models, including ChatGPT and GPT4, as GPT-3 family large language models (GLLMs). With the ever-rising popularity of GLLMs, especially in the research community, there is a strong need for a comprehensive survey which summarizes the recent research progress in multiple dimensions and can guide the research community with insightful future research directions. We start the survey paper with foundation concepts like transformers, transfer learning, self-supervised learning, pretrained language models and large language models. We then present a brief overview of GLLMs and discuss the performances of GLLMs in various downstream tasks, specific domains and multiple languages. We also discuss the data labelling and data augmentation abilities of GLLMs, the robustness of GLLMs, the effectiveness of GLLMs as evaluators, and finally, conclude with multiple insightful future research directions. To summarize, this comprehensive survey paper will serve as a good resource for both academic and industry people to stay updated with the latest research related to GPT-3 family large language models.

Index Terms—Large Language Models, GPT-3, ChatGPT, GPT-4, Transformers, Survey.

Contents

1         INTRODUCTION

Large Language Models (LLMs), the recent buzz in Artificial Intelligence, have garnered a lot of attention in both academic and industry circles with their remarkable performances in most of the natural language processing (NLP) tasks. These models are essentially deep learning models, specifically transformer-based, pretrained on large volumes of text data and then aligned to human preferences using meta-training. Pretraining provides universal language knowledge to the model [1], while meta-training aligns the model to act based on the user’s intentions. Here user’s intention includes both explicit intentions, like following instructions, and implicit intentions, like maintaining truthfulness and avoiding bias, toxicity, or any harmful behaviour [2]. Large language models (LLMs) are a special class of pretrained language models obtained by scaling model size, pretraining corpus and computation. For downstream task usage, pre- trained language models leverage supervised learning paradigm, which involves task-specific fine-tuning and hundreds or thousands of labelled instances [1], [3]. LLMs leverage in-context learning (ICL), a new learning paradigm which doesn’t require task-specific fine-tuning and a large number of labelled instances [4]. LLMs treat any NLP task as a conditional text generation problem and generate the desired text output just by conditioning on the input prompt, which includes task description, test input and optionally, a few examples. Figure 1 shows the evolution of artificial intelligence from machine learning to large language models.

In the beginning, NLP systems are predominantly rule-based. These rule-based models are built on top of domain expert-framed rules. As manual rule framing is a laborious, expensive process and also requires frequent changes, rules-based models are gradually replaced by machine models, which learn the rules automatically from the training data and completely avoid manual rule framing [1]. However, machine learning models require human intervention in the form of domain experts for feature engineering. The evolution of dense text vector representation models like Word2Vec [5], Glove [6], FastText [7] and the advancement of computer hardware like GPUs, NLP systems are built using traditional deep learning models like CNN [8], RNN [9], LSTM [10], GRU [11], Seq2Seq [12] and Attention-based Seq2Seq models [13], [14]. However, the drawbacks of these models like the inability to (i) capture long-term dependencies and (ii) leverage GPUs fully because of sequential processing (except in the case of CNN), resulted in the evolution of advanced deep learning models like Transformers [15], which are fully attention based without any recurrent and convolution layers.

Inspired by the success of image-pretrained models [16]–[18] built on top of transfer learning and large convolution models, the research community focused on building pretrained language models (PLMs) like BERT [19] and GPT-1 [20] with transformers as the backbone and pretrained based on a new learning paradigm called self-supervised learning [1], [21], [22]. Unlike traditional deep learning models and vanilla transformers, which require training from scratch for downstream usage, pretrained language models can be easily adapted to downstream tasks with fine-tuning. The huge success of BERT and GPT-1 models triggered the development of other pretrained language models like RoBERTa, XLNet [23], ELECTRA [24], ALBERT [25], DeBERTa [26], [27], GPT-2 [28], T5 [29], BART [30] etc.

Although PLMs have many advantages compared to traditional deep learning and vanilla transformer models, they still suffer from drawbacks like the inability to generalize to unseen tasks without task-specific training. So, the research community focused on developing more advanced models like large language models which can generalize to unseen tasks without any task- specific training. The era of LLMs started with GPT-3 [4], and the success of GPT-3 inspired the development of other LLMs like PaLM [31], Chinchilla [32], GLaM

Fig. 1: Evolution of artificial intelligence from machine learning to large language models.

[33], LaMDA [34], Gopher [35], Megatron–Turing NLG [36][181], BLOOM [37], Galactica [38], OPT [39], LLaMA [40], [41] etc. The popularity of LLMs is increasing ex- ponentially after the recent launch of Open AI’s models like ChatGPT and GPT-4 [42]. For example, ChatGPT has garnered millions of users within a few weeks of its launch. Because of the ability to generalize to unseen tasks based on the task description and a few examples without requiring any task-specific training, just like humans, LLMs can be considered as a baby step towards Artificial General Intelligence [43]. In this survey paper, we mainly focus on Open AI LLMs like GPT-3 models, GPT-3.5 models (InstructGPT, ChatGPT etc.) and GPT-4, which we refer to as GPT-3 family large language models (GLLMs). This survey paper provides a comprehensive review of research works related to GLLMs in multiple dimensions.

Contributions. The key contributions of this survey paper are

  • First survey paper to present a comprehensive review of GPT-3 family large language models (GLLMs) in multiple dimensions covering more than 350 recent research papers.
  • We discuss various foundation concepts like transformers, transfer learning, self-supervised learning, pretrained language models and large language models.
  • We discuss GPT-3 family large language models in detail, starting from GPT-3 to the latest ChatGPT and GPT-4.
  • We discuss the performances of GLLMs in various downstream tasks and present a thorough discus- sion on the data labelling, and data augmentation abilities of GLLMs.
  • We discuss the robustness and the evaluation abilities of GLLMs.
  • We present multiple insightful future research directions which will guide the research community to improve the performances of GLLMs further.

Comparison with existing surveys. The existing sur- vey papers provide a review of large language models [44] and the relevant concepts like in-context learning [45], evaluation [46], [47], alignment with human values [48], [49], safety and trustworthiness [50], reasoning [51], challenges and applications [52], LLM compression [53] and multi-modal LLMs [54]. For example, Zhao et al. [44] are the first to provide a comprehensive of large language models. Unlike Zhao et al. [44], the other existing survey papers focus on specific concepts of LLMs. For example, the survey papers written by Dong et al. [45], Chang et al. [46], Wang et al. [48] and Huang et al. [51] focus on in-context learning, evaluation of LLMs, alignment of LLMs with human values and reasoning ability of LLMs respectively. Similarly, the survey papers written by Yin et al. [54] and Huan et al. [50] provide a review of multi-modal LLMs and the safety and trust- worthiness of LLMs, respectively. However, there is no existing survey paper which provides a comprehensive survey of GPT-3 family large language models. With the ever-rising popularity of GPT-3 family large language models like GPT-3, InstructGPT, ChatGPT, GPT-4 etc. and a lot of research works using these models, there is a strong need for a survey paper which focuses exclusively on GPT-3 family large language models.

Papers collection. For this survey paper, we gathered over 350 research papers that appeared online in the period of June 2020 to September 2023. Initially, we selected GLLMs like GPT-3, InstructGPT, Codex and GPT-4 papers as seed papers and collected all the citing papers. We also collected papers from popular venues like ACL, EMNLP, COLING, AAAI, ICML, ICLR, NeurIPS etc and popular databases like Google Scholar and ScienceDirect using the keywords GPT-3, ChatGPT, GPT-3.5, Instruct- GPT, Codex and GPT-4. After removing the duplicate papers, we did a manual review to arrive at a final set of over 350 relevant research papers.

Survey paper organization. The survey paper is organized as follows: Section 2 presents a brief overview of various foundation concepts like transformers, transfer learning, self-supervised learning, pretrained language models and large language models. Section 3 presents GPT-3 family large language models in detail, starting from GPT-3 to the latest ChatGPT and GPT-4. Sections 4, 5, and 6 discuss the performances of GLLMs in various downstream tasks, specific domains and multilingual scenarios, respectively. Section 7 presents the data labelling and data augmentation abilities of GLLMs. Section 8 discusses various research works presenting approaches to detect text generated by GLLMs. Sections 9 and 10 discuss the robustness and evaluation abilities of GLLMs, respectively. Section 11 presents multiple insightful future research directions.

2         FOUNDATION CONCEPTS

2.1       Transformer

  • 2.1.1       Traditional Deep Learning Models

Before the evolution of the transformer model, most of the research in natural language processing involved deep learning models like multi-layer perceptron (MLP), convolutional neural network (CNN), recurrent neural network (RNN), long short-term memory (LSTM) net- work, gated recurrent unit (GRU), sequence-to-sequence and attention-based sequence-to-sequence [55]. MLP is a feed-forward neural network with three or more layers (input layer, one or more hidden layers, and output layer), and the neurons in these layers are fully connected. MLPs are easy to understand and simple to implement. However, as MLPs ignore the sequence information and struggle to capture the semantic relationships, these models are subsequently replaced by advanced models like CNN and RNN. CNN, originally developed to process images, is also explored for natural language processing tasks by treating text as a one-dimensional image [8], [56]. CNNs can learn local features (n-grams) effectively using convolution layers but struggle to capture long-term dependencies. RNNs evolved as a deep learning model exclusively to process sequential data like text, time series, etc [9]. RNNs can handle input with varying lengths and process sequential data by maintaining a hidden state to capture the context from previous inputs. However, RNNs suffer from vanishing gradients problems and struggle to capture long-term dependencies. LSTM [10] and GRU [11], [57] evolved as advanced RNN variants to address the issues with the vanilla RNN model. The gating mechanism in these models helps to regulate the flow of information along the sequence and retain the most important information. Compared to LSTM, which includes three gates (input, forget and output gates), GRU is more parameter efficient as it includes only two gates, namely the input and the reset gates.

RNN and its variants like LSTM and GRU expect the input and output sequences to be the same length. However, in the case of natural language generation tasks like machine translation, text summarization, etc., the input and output sequences can be of different lengths. So, the researchers introduced the sequence-to- sequence (Seq2Seq) model to handle tasks with different input and output sequence lengths [12]. The Seq2Seq model is originally developed for machine translation and later explored for other NLP tasks. The Seq2Seq model consists of an encoder and decoder based on RNN, LSTM or GRU to process the input sequence and generate the output sequence. The encoder processes the input sequence to generate a fixed-size context vector based on which the decoder generates the output sequence. However, the fixed-size context vector fails to encode the entire information in the input sequence, especially when the input sequence is long [13]. The attention mechanism is introduced to address this issue, allowing the decoder to focus on the relevant input tokens at each decoding step [13], [14]. However, as the encoder and decoder of the Seq2Seq model are based on RNN and its variants, the Seq2Seq model suffers from vanishing gradients and struggles to capture long-term dependencies.

  • 2.1.2       Drawbacks of Traditional Deep Learning Models

Here are the drawbacks of traditional deep learning models

  • Lack of sequence and semantic understanding – MLPs ignore sequence information, treating all input to- kens as independent. Moreover, MLPs can learn statistical patterns but struggle to capture semantic information in the input sequence.
  • Computationally expensive – CNNs require a large number of parameters to achieve good results. Al- though LSTM and GRU address the limitations of vanilla RNNs to some extent, these models include a gating mechanism which significantly increases the number of model parameters. The large number of parameters makes these models computationally expensive to train and use.
  • Vanishing gradients – RNN suffer from vanishing gradients problem. Although LSTM and GRU ad- dress this problem to some extent, these models also suffer from vanishing gradient problem and have difficulties in capturing long-term dependencies.
  • Sequential Computation – RNN and its variants process the input sequence token by token, i.e. sequentially. This sequential computation is a bottleneck for these models to leverage parallel computing capability in advanced computing hardware like GPUs and TPUs. This sequential computation also slows down training and inference processes, especially for long sequences.
  • 2.1.3       Transformer Description

The transformer model evolved as an effective alternative to traditional deep learning models and addressed most associated issues [15]. In no time, the transformer model, with its novel and efficient architecture, gained a lot of popularity and became a de facto choice for building pretrained language models and large language models using self-supervised learning paradigm [1], [44]. The key ingredient behind the massive success of the transformer model is its self-attention mechanism. The self-attention mechanism allows the transformer model to process the input sequence without using recurrent or convolution layers. This attention mechanism also allows the model to effectively capture long-range dependencies in the input sequence, making it highly effective for natural language understanding and generation tasks.

The transformer consists of encoder and decoder components. The encoder processes the input text using a stack of encoder layers and then produces rich contextualized vector representations for each token in the input sequence, which are later used by the decoder. Each encoder layer consists of a self-attention mechanism and a feedforward neural network. The self-attention mechanism adds contextual information to the token vectors by allowing each token to attend to all other input tokens, and this helps the model to capture long-term dependencies better. After the self-attention mechanism, the token vectors are passed through a feedforward neural network, which introduces non-linearity and further transforms the representations. In this way, each encoder layer applies self-mechanism and feed-forward network to add more contextual information to the token vector representations.

The decoder receives the output from the last encoder layer and processes it sequentially by applying a stack of layers, with each decoder layer having masked self-attention, encoder-decoder self-attention and feed- forward neural network. The masked self-attention al- lows each token to attend to the previously generated tokens only and prevents the model from attending to future tokens. The encoder-decoder self-attention allows the decoder to attend to the encoded input sequence and helps the decoder focus on relevant input sequence tokens to generate the output tokens.

The self-attention mechanism in the Transformer uses multiple attention heads, which allow the model to learn different aspects of relationships between tokens and encode more contextual information in the token representations. The encoder and decoder layers also include the embedding layer, residual connections and layer normalization. The embedding layer transforms input tokens into vector representations where each vector representation encodes both the meaning and position information. The residual connections and layer normalization are applied after the self-attention mechanism and feed-forward network. Residual connection avoids vanishing gradients and ensures a smooth flow of gradients, while layer normalization is applied to normalize the token representations and stabilize training. Apart from the embedding layer and stack of decoder layers, the decoder also includes an output layer. The output layer is nothing but a softmax layer that assigns probabilities to each token in the vocabulary, indicating the likelihood of each token being the next word in the generated sequence.

2.2       Transfer Learning

  • 2.2.1       Why Transfer Learning?

Although machine learning models tasted some success, these models require feature engineering, which is a laborious and expensive process involving human intervention in the form of domain experts [1]. Deep learning models, essentially a subset of machine learning, don’t require feature engineering as deep learning models learn features during training. Over the years, deep learning witnessed the evolution of various models like multi-layer perceptron (MLP), convolution neural networks (CNN), recurrent neural networks (RNN), long short-term memory networks (LSTM), gated recurrent unit networks (GRU), encoder-decoder networks, encoder-decoder with attention networks and recently transformers [55], [59]. Even though deep learning models eliminated the requirement of manual feature engineering and achieved significant progress, the main drawback with these models is the requirement of a large amount of labelled data to achieve good results. Along with developing various deep learning models, the research community also focused on developing high-quality datasets for various tasks [60]. However, manual data annotation is a time-consuming, expensive and laborious process. Additionally, when there is a change in the data distribution, it is essential to re-train deep learning models with new labelled data to maintain good performances [61]. To reduce the costs, the research community focused on how to effectively train deep learning models with limited labelled data. Transfer learning evolved as one of the effective solutions to train deep learning models with limited labelled data [58], [61].

  • 2.2.2       What is Transfer Learning?

Transfer Learning in the context of artificial intelligence involves existing knowledge transfer from one task (or domain) to another different but related task (or domain) [58], [61]. Transfer learning avoids training a model from scratch and helps improve the model’s performance on the target task (or domain) by leveraging already existing knowledge. Transfer learning is largely based on the idea that when two tasks (or domains) are similar, the knowledge from the source task (or domain) with sufficient data can be used to enhance the performance of the

Fig. 2: Real-life examples of knowledge transfer (transfer learning). Examples are inspired from [58]

target task (or domain) with limited data. For example, consider the task of sentiment analysis of reviews of different products. It is highly expensive to annotate large data separately for each product. In such cases, transfer learning helps to adapt the model trained on one product reviews to perform well on other product reviews without requiring large labelled data [62].

Transfer learning draws inspiration from human beings, i.e., human beings can do new tasks without or with few examples just by reusing previously gained knowledge [60]. Figure 2 illustrates real-life examples of knowledge transfer (transfer learning). For example, a person who can cycle can learn to ride a bike quickly with less effort. This is because riding a cycle and a bike involves a lot of common things like handling the balance, etc. Similarly, a person familiar with C programming language can learn Python programming language easily. This is because both C and Python are programming languages and share many common concepts. So, due to the ability to reuse the existing knowledge and train the target models with limited data, transfer learning evolved as a promising learning paradigm and eventually played a crucial role in the evolution of advanced deep learning models like pretrained language models [1], [3] and the recent large language models. Overall, the advantages of transfer learning are

  • Transfer learning helps to reduce the requirement of labelled data. (Data efficiency)
  • Transfer learning avoids training models from scratch by providing a good initialization from existing related models. (Faster training and development)
  • Transfer learning helps to enhance the performance on the target task (or domain) by reusing existing knowledge. (Enhance target task performance)
  • Transfer learning is explored across AI areas like computer vision, natural language processing, and speech processing. (Versatile)

In conclusion, transfer learning is a powerful learning paradigm in artificial intelligence that has benefits re- garding data efficiency, speed, performance, adaptability, and real-world practicality.

  • 2.2.3       Transfer Learning vs Other Learning Paradigms

Along with transfer learning, the other learning paradigms that evolved to address large labelled data re- quirements are semi-supervised learning [63] and multi- task learning [64]. Semi-supervised learning is a learning paradigm in artificial intelligence that uses labelled and unlabelled data to train models [63]. As semi-supervised learning uses labelled and unlabelled data, it lies be- tween unsupervised and supervised learning paradigms. As semi-supervised learning uses only a small amount of labelled data, it reduces the amount of labelled data required, like transfer learning. However, unlike transfer learning, where the distribution of source and target tasks can be different, in semi-supervised, the distribution of labelled and unlabelled data should be the same [58]. Multi-task learning is a learning paradigm which focuses on enhancing the performance of a group of tasks by leveraging the interconnections between the tasks and learning them simultaneously [63]. Unlike multi-task learning, which simultaneously learns all the tasks, transfer learning first learns the source task and then transfers the knowledge to the target task. In multi- task learning, the focus is generally on all the tasks, while transfer learning focuses more on the target task [61].

2.3       Self-Supervised Learning

  • 2.3.1       Why Self-Supervised Learning?

The main drawback with traditional deep learning models like CNN is the requirement of training from scratch. Training from scratch requires a large amount of labelled data. Data labelling is not only expensive but also a time-consuming and laborious process, which eventually makes the model development expensive. To reduce the requirement of labelled data and make the model development process less expensive, the computer vision research community focused on developing models like VGGNet [17], AlexNet [16] and GoogleNet [18] on top of large CNNs, transfer learning and supervised learning. These models are pretrained on a large number of labelled images from ImageNet dataset [65] using supervised learning, and then adapted to downstream

Fig. 3: Illustration of self-supervised learning paradigm.

tasks. These pretrained models avoid training down- stream models from scratch by providing a good initialization. Moreover, downstream models initialized from pretrained models converge faster and achieve good results even with limited labelled data [60].

Inspired by the huge success of pretrained image models, the NLP research community focused on developing pretrained language models [1], [3], [60]. However, the main challenge here is the use of supervised learning at scale to pretrain language models. This is because super- vised learning at scale requires huge volumes of labelled data, which is almost impossible to obtain in many cases because of highly expensive annotation costs. Besides high annotation costs, supervised learning also suffers from generalization errors and spurious correlations [1], [22]. Self-supervised learning with the ability to automatically generate the labels and make use of unlabelled data evolved as an effective alternative to supervised learning to pretrain language models at scale [1], [21], [22].

  • 2.3.2       What is Self-Supervised Learning?

Self-supervised learning, a promising learning paradigm in artificial intelligence, helps models from different modalities like language, speech or image to learn back- ground knowledge from large volumes of unlabeled data [21], [22]. Unlike supervised learning, which relies on large volumes of labelled data, self-supervised learning pretrains the models at scale based on the pseudo super- vision offered by one or more pretraining tasks. Here, the pseudo supervision stems from the labels, which are automatically generated without human intervention based on the description of the pretraining task. In general, self-supervised learning involves one or more pretraining tasks [1], [3]. Moreover, the efficiency of self- supervised learning is heavily influenced by the choice of pretraining task [1], [24], [26].

Figure 3 presents the self-supervised learning paradigm. In the pretraining phase, the labels are automatically generated based on the description of pretraining tasks, and the models learn universal knowledge using the pseudo supervision offered by one or more pretraining tasks. Pretraining helps the models to gain strong background knowledge, which allows the models to provide a good initialization to downstream models. The initialization from pretrained models enhances the downstream models in terms of generalization, performance, and robustness and makes them data efficient. After pretraining, pretrained language models can be easily adapted to downstream tasks with limited labelled data, and large language models can be used to solve downstream tasks using in-context learning without any task-specific fine-tuning.

  • 2.3.3       Evolution of Self-Supervised Learning

Figure 4 shows the evolution of self-supervised learning in natural language processing from embedding models to the recent large language models. The evolution of self-supervised learning in natural language processing happened in three stages, namely embedding models, pretrained language models and large language models. Initially, self-supervised learning is explored to develop non-contextual embedding models (e.g. Word2Vec [5], FastText [7]), followed by sentence embedding (e.g. Sent2Vec [66]) and contextual embedding models (e.g. ELMo [67]). The quest to develop pretrained models motivated NLP researchers to explore self-supervised learning to develop pretrained language models [1], [3], [60]. As pretrained language models cannot generalize to NLP tasks without fine-tuning, the NLP research com- munity focused on developing large language models using self-supervised learning at a large scale [4], [40]– [42], [68]. To summarize, self-supervised is undergoing a rapid evolution and is also treated as a significant element in achieving near human-level intelligence [22].

  • 2.3.4       Self-Supervised Learning vs Other Learning Paradigms

Self-supervised learning, with its exceptional ability to make use of unlabelled data at scale, evolved as an alternative to supervised learning to pretrain models. However, self-supervised learning has similarities and dissimilarities with supervised learning [1]. Both self- supervised and supervised provide supervision. How-

Fig. 4: Evolution of self-supervised learning in natural language processing.

ever, unlike supervised learning, which offers super- vision based on human-labelled data, self-supervised learning offers supervision based on automatically generated data. Supervised learning is mostly used to train downstream models with task-specific data, while self- supervised learning is used to train pretrained models to offer good initialization to downstream models. Similarly, self-supervised learning has similarities and dissimilarities with unsupervised learning [1]. Both self- supervised learning and unsupervised learning make use of unlabelled data without requiring any labelled data. However, unlike self-supervised learning, which focuses on learning rich data representations using pseudo supervision, the main focus of unsupervised learning is to identify the hidden patterns in the data without any supervision.

2.4       Pretrained Language Models

  • 2.4.1       Overview

Deep learning witnessed the evolution of several models, from convolution neural networks to the latest trans- formers [55], [59]. Transformer addressed drawbacks of traditional deep learning models like convolutional neural network, recurrent neural network and its variants and achieved significant progress [15], [69]. However, transformer and traditional deep learning models suffer from one major drawback: training from scratch, which requires large volumes of labelled data and makes model development expensive. Inspired by the success of pretrained image models like VGGNet [17], AlexNet [16] and GoogleNet [18] in computer vision, NLP re- searchers focused on developing pretrained models for natural language processing based on transformers and self-supervised learning [1], [3], [60], [70]. Pretrained language models are advanced deep learning models essentially transformer-based, pretrained on large volumes of text data and can be adapted to downstream tasks with limited labelled data. Along with transformer model, self-supervised learning and transfer learning are key concepts which make pretrained language models possible [1] (refer Figure 5). The era of pretrained lan- guage models started with GPT-1 [20] and BERT [19] models. The massive success of BERT and GPT-1 models triggered the development of other pretrained language models like RoBERTa [71], XLNet [23], ELECTRA [24], ALBERT [25], DeBERTa [26], [27], GPT-2 [28], T5 [29], BART [30], PEGASUS [72] etc.

  • 2.4.2       Evolution of Pretrained Language Models

The evolution of pretrained language models happened along three dimensions: encoder-based models, decoder- based models and encoder-decoder based models [1]. Encoder-based models consist of an embedding layer and stack of encoder layers, with each encoder layer having self-attention and feed-forward networks. Encoder- based models are primarily used for natural language understanding tasks like text classification, entity ex- traction, relation extraction, etc. Some of the popular encoder-based pretrained language models are BERT, RoBERTa, XLNet, ALBERT, ELECTRA, DeBERTa, etc.

Decoder-based models consist of an embedding layer and a stack of decoder layers, with each decoder layer having self-attention, masked self-attention and feed- forward networks. Decoder-based models are used for both natural language understanding and generation tasks. Some of the popular decoder-based pretrained language models are GPT-1, GPT-2 etc. Encoder-decoder based models consist of both encoder and decoder modules. In general, encoder-decoder based models are used for natural language generation tasks like machine translation, text summarization, etc., while some are explored for both natural language understanding and generation tasks. Some of the popular encoder-decoder based models are T5, BART, PEGASUS, M2M100, NLLB, etc.

After the massive success of pretrained language models in the English language, the research community started to develop multilingual pretrained language models [73] and pretrained language models for non- English languages [1]. Some of the popular multilingual pretrained language models are mBERT [19], mT5 [74], mBART [75], IndicBERT [76], XLM [77], XLM-R [78], mDeBERTa [26] etc. As the performance of general do- main pretrained language models is limited in domain-

Fig. 5: Key ingredients in the evolution and success of pretrained language models.

specific tasks [1], [3], the research community focused on developing pretrained language models for specific domains like social media [79], [80], finance [81]–[83], legal [84], [85], coding [86]–[88], healthcare [89]–[91] etc., As pretrained language models have millions of parameters which make model fine-tuning and deployment expensive, compact pretrained language models like DistilBERT [92], TinyBERT [93], MobileBERT [94], MiniLM [95]etc., are developed. As pretrained language models have a limited context length which limits the performance on long sequences, long-sequence pretrained language models like LongFormer [96], BigBird [97] etc., are developed. Pretrained language models encode only the universal language knowledge available in the pre- training corpus and lack valuable knowledge available in ontologies. So, the research community developed ontology-enriched models like SapBERT [98], UmlsBERT [99], etc.

2.5       Large Language Models

  • 2.5.1       Overview

The pretrained language models, starting from GPT-1 [20], BERT [19] models to the latest DeBERTa [26], [27], achieved significant progress and also reduced the amount of labelled data required to train the task-specific models [1], [3]. Pretrained language models follow the paradigm “pretrain then fine-tune”, i.e., the model is pretrained first and then adapted to downstream tasks by fine-tuning. As task-specific fine-tuning is mandatory to adapt the pretrained language model to downstream tasks, pretrained language models cannot generalize to unseen downstream tasks without task-specific fine- tuning. Moreover, task-specific fine-tuning requires la- belled data and creates a separate copy of the pretrained language model for each downstream NLP task, increasing the model development and deployment costs [1].

Pretrained language models are treated as narrow AI systems as they are adapted through fine-tuning and then used for specific downstream tasks. However, the main focus of the research community is to develop artificial general intelligence systems [43], [100] which are not narrowly focused on specific tasks but have the ability for general problem-solving and can handle even the unseen tasks by utilizing the existing knowledge like human beings. The NLP researchers observed that the performance of pretrained language models can be enhanced further through scaling along three dimensions: pretraining computation, pretraining data and model size [28], [29], [71]. Large size allows the models to capture more nuanced language patterns, which in turn enhances their ability to understand and generate text, while large pretraining data helps the model to learn from a wider range of text. The promising results from scaling and the quest to build artificial general intelligence systems motivated NLP researchers to build much bigger and bigger models, which eventually resulted in the evolution of GPT-3 and its successor models [4], [31]–[33]. Learning paradigms like transfer learning and self-supervised learning make large language models possible, but scaling makes these models powerful.

The research community coined a new phrase, “large language models”, to refer to GPT-3 and its successor large models to differentiate these models from small pretrained language models [44]. Large language models (LLMs) are a special class of pretrained language models obtained by scaling model size, pretraining corpus and computation as showin in Figure 6. Large language models (LLMs) are essentially deep learning models, specifically transformer-based, pretrained on large volumes of text data and aligned to human preferences using meta-training. Pretraining provides universal language knowledge to the model [1], while meta-training aligns the model to act based on the user’s intentions. Here, the user’s intention includes explicit intentions, like following instructions, and implicit intentions, like maintaining truthfulness and avoiding bias, toxicity, or harmful behaviour [2].

Because of their large size and pretraining on large volumes of text data, LLMs exhibit special abilities referred to as emerging abilities [101], [102], allowing them to achieve remarkable performances without any task- specific training in many natural language processing tasks. For downstream task usage, pretrained language models leverage supervised learning paradigm, which involves task-specific fine-tuning and hundreds or thou- sands of labelled instances [1], [3]. LLMs leverage in- context learning (ICL), a new learning paradigm that doesn’t require task-specific fine-tuning and many la-

Fig. 6: Key ingredients in the evolution and success of large language models.

belled instances [4], [45]. LLMs treat any NLP task as a conditional text generation problem and generate the desired text output by conditioning on the input prompt, including task description, test input and optionally, a few examples.

  • 2.5.2       Evolution of Large Language Models

The evolution of large language models happened along two dimensions: closed-source LLMs and open-source LLMs. The era of LLMs roughly started with GPT-3. Following the success of GPT-3, Open AI developed successor models like InstructGPT [2], Codex [103], Chat- GPT and GPT-4 [42]. Google introduced models like GLaM [33], PaLM [31], PaLM2 [68], LaMDA [34] and

Bard. DeepMind developed models like Gopher [35], Chinchilla [32], AlphaCode [104] and Sparrow [105]. Companies like Baidu, AI21 labs and Amazon developed the models Ernie 3.0 Titan [106], Jurassic-1 [107] and AlexaTM [108], respectively. Although the performances of closed-source LLMs are impressive, the main draw- back with these models is that they are behind the paywalls, i.e., their weights are not publicly available, only some of them are accessible only through the APIs offered by the respective companies, and the model usage is charged based on the tokens processed and generated.

To address this issue, the research community focused on developing open-source LLMs with publicly available weights. Some of the popular open-source LLMs are OPT [39], OPT-IML [109], Galactica [38], LLaMA [40], LLaMA2 [41] and Falcon. The performances of these open-source LLMs are on par with closed-source LLMs. Moreover, in some cases, open-source LLMs outperform closed-source LLMs. For example, Galactica beats closed- source LLMs like GPT-3, Chinchilla and PaLM. Inspired by the success of open-source LLMs in the English language, the research community focused on developing multilingual and bilingual LLMs. BLOOM [37] and BLOOMZ [110] are examples of multilingual LLMs, JAIS [111] (English and Arabic), GLM [112] (English and Chinese) and FLM-101B [113] (English and Chinese) are examples of bilingual LLMs.

The success of closed and open-source LLMs in the general domain triggered the development of domain- specific LLMs like FinGPT [114] and BloombergGPT [115] in the finance domain, MedPaLM [116] and Med- PaLM2 [117] in the healthcare domain and StarCoder [118], CodeLlaMa [119], CodeGen [120] and CodeGen2 [121] in the coding domains. For example, Bloomberg developed BloombergGPT, an exclusive LLM for the finance domain. Similarly, Google developed MedPaLM and MedPaLM2 LLMs exclusively for the healthcare domain based on PaLM and PaLM2 models respectively. Similarly, HuggingFace developed StarCoder, MetaAI developed Code LlaMA, and SalesForce developed CodeGen and CodeGen2 LLMs exclusively for coding tasks.

3         GPT-3 FAMILY LARGE LANGUAGE MODELS

3.1       Overview

Open AI, an AI company established in 2015, focused on building generative models. The Open AI researchers initially explored RNNs for developing generative language models [122]. Inspired by the huge success of the transformer model and its ability to capture long-term dependencies, Open AI researchers leveraged the trans- former decoder to build GPT-1 (117M parameters), the first-ever transformer-based pretrained language model [20]. GPT-1 introduced a new paradigm, “pretrain and fine-tune”, to develop downstream task models effectively. Originally, the “pretrain and fine-tune” paradigm was introduced by Dai et al. [123] and then explored by Howard and Ruder [124] to build language models for text classification. However, unlike Radford et al.

[20] work, these research works build language models based on LSTM, which lacks parallelization ability and has difficulties in capturing long-term dependencies. Radford et al. [20] used casual language modeling as a pretraining task to pretrain the GPT-1 model. The casual language modeling pretraining task involves generating the next token based on the previous tokens. GPT-1 achieved SOTA results in 9 out of 12 NLP tasks [20].

Fig. 7: Open AI journey starting from GPT-1 to the latest GPT-4.
Fig. 8: GPT-3 family large language models (GLLMs) starting from GPT-3 series to the latest GPT-4. Here, SFT stands for supervised fine-tuning, and RLHF stands for reinforcement learning from human feedback. Here, raw represents that the model is just pretrained and is not aligned using SFT or RLHF. Here, RLHF-Chat represents that the model is aligned using RLHF and optimized for chat.

Inspired by the success of GPT-1, Open AI researchers introduced the GPT-2 model to push the results further [28]. The GPT-2 model is pretrained on the Web- Text corpus (40B text), which is much larger than the Books corpus used to pretrain the GPT-1 model. The authors developed four versions of the GPT-2 model with varying parameters: 117M, 345M, 762M and 1.5B. The authors observed that the perplexity decreases with an increase in the model’s size, and even for the largest version of 1.5B, the decrease in perplexity did not exhibit saturation. This revealed that GPT-2 underfitted the pretraining dataset, and extending the training duration could have further reduced perplexity. This observation triggered the insight that “developing even larger language models will decrease the perplexity further and enhance natural language understanding and generation capabilities”. The insights gained from the GPT-1 and GPT-2 models laid a strong foundation for the evolution of the GPT-3 family large language models, including the latest models like ChatGPT and GPT-4. Figure 7 shows the journey of Open AI starting from GPT-1 to the latest GPT-4 and Figure 8 shows the GPT-3 family large language models starting from GPT-3 series to the latest GPT-4.

3.2       GPT-3 Models

The experiment results of GPT-2 showed that increasing the model size further reduces the perplexity, and the model with more parameters achieves better results than the models with fewer parameters. This observation motivated Open AI researchers to train much bigger GPT models, which eventually resulted in the introduction of the GPT-3 model [4]. GPT-3 model contains 175B parameters and is 100 times bigger than its predecessor model, GPT-2. Moreover, the GPT-3 model is trained over a corpus with the text from multiple sources like webpages, Wikipedia and books, unlike GPT-1 and GPT- 2 models, which are pretrained over corpora with the text from books and webpages, respectively. Scaling in three dimensions: pretraining data, model size, and pretraining computation allows the GPT-3 model to learn more from large volumes of texts from different sources, which eventually empowers the model to handle unseen tasks without any task-specific training. Unlike GPT-1 and GPT-2 models, which leverage supervised learning to do downstream tasks, GPT-3 leverages training-free in-context learning. In-context learning is a new learning paradigm that is training-free and solves the down- stream tasks by using knowledge encoded in the model parameters [45]. In-context learning accepts prompts as input where the input prompt consists of task descriptions, optimally few examples and other instructions.

3.3       GPT-3.5 Models

Two main drawbacks of the GPT-3 model are (i) GPT-3 is not trained over code data, and hence, it lacks complex reasoning abilities like solving math problems [44], and (ii) GPT-3 model struggles to follow user instructions and sometimes generate harmful text [2]. These two drawbacks are addressed by GPT-3.5 models. Brown et al. [4] observed that GPT-3 can generate simple programs, although it is not specifically trained for generating code. The Open AI researchers triggered by this observation introduced Codex [103], an exclusive GLLM for coding tasks. Codex is developed by fine-tuning a GPT model with 12B parameters over publicly available Github code. Moreover, it is observed that GPT models explicitly trained over code data exhibit better reasoning capabilities.

During pretraining, the GPT-3 model is optimized based on the casual language modeling objective, which involves predicting the next word based on the previous words. In-context learning during inference can be viewed as conditional text generation, where the model generates the output by conditioning on the given prompt. The model performs text generation during pretraining and inference, but it does vanilla text generation during pretraining and conditional text generation during inference. During pretraining, the model conditions on the previous words and generates the next word, i.e., vanilla text generation. However, during in- context learning, the model conditions on the prompt and generates the answer rather than generating the next words, i.e., conditional text generation. So, there is a gap between pretraining and in-context learning at inference. Due to this, in many cases during inference, the GPT-3 model fails to understand the given prompt and tends to generate the next words.

The pretraining corpus of the GPT-3 model includes some amount of text with undesired qualities like misinformation, abuse, hate, sexism, etc., due to which the model sometimes generates harmful text. To enhance complex reasoning ability, the instruction following ability and reduce the harmful text generation, GPT-3.5 models are developed by fine-tuning GPT-3 models over code data and then aligned using supervised fine-tuning (SFT) or reinforcement learning from human feedback (RLHF) [2]. For example, the text-davinci-002 model is developed by fine-tuning the GPT-3 model (text-davinci) over code data to get code-davinci-002, which is further aligned using SFT.

3.4       ChatGPT and GPT-4

GPT-3 models are capable of understanding and generating natural language, while GPT-3.5 models are capable of understanding and generating both natural language and code. However, both GPT-3 and GPT-3.5 models are not chat optimized. This drawback is addressed by ChatGPT (GPT-3.5-turbo) and GPT-4 [42] models. Open AI introduced ChatGPT in November 2022. With extraordinary conversational abilities, ChatGPT, ChatGPT has garnered millions of users within a few weeks of its launch. Following ChatGPT, Open AI released the GPT-4 model in March 2023, which can handle both text and image inputs. Apart from generating text with human-like fluency, these models further pushed the results in many natural language processing tasks. The performance of these models in downstream tasks and specific domains is discussed in detail in Sections 4 and 5.

4         PERFORMANCE    OF    GLLMS     IN    DOWN-STREAM TASKS

4.1       Text Classification

Overview. Text Classification is one of the fundamental tasks in natural language processing [145]. It involves assigning label(s) from a predefined set of labels to a given piece of text. Here, the piece of text can be a phrase, sentence, paragraph or even a document. Many of the natural language processing problems, like offensive language identification, stance detection, sentiment analysis, hate speech detection, etc., are approached as text classification. Text Classification can be binary, multi-class or multi-label.

In the case of text classification, the large language model is prompted with a task description, a predefined set of labels, examples (optional) and the test input. Here, task description, a predefined set of labels and examples constitute the context. The model understands what actually the task is from the context and then assigns the most appropriate label(s) to the given test input. The additional inputs, like examples in the context, enrich the prompt with more information which allows the model to understand the task better and then perform better.

PaperTask(s)GLLMs ExploredPrompt SettingsDomain(s)Language(s)SOTA Results
[125]Stance DetectionChatGPTZS, FSSocial MediaEnglishNo
[126]Stress Detection, Depression Detection , Suicidal DetectionChatGPTZSSocial MediaEnglishNo
[127]Mental Health Analysis TasksChatGPTZSSocial MediaEnglishNo
[128]Sentiment AnalysisChatGPTZS, FSSocial MediaEnglish, ChineseNo
[129]Stock Prediction based on Sentiment AnalysisChatGPTZSFinanceEnglishNo
[130]Computational Social Science TasksGPT-3, ChatGPTZSSocial MediaEnglishNo
[131]Genre IdentificationChatGPTZSGeneralEnglish, SlovenianNo
[132]Sentiment Analysis, Misinformation DetectionChatGPTZSSocial MediaEnglish, Indonesian, Javanese, BugineseNo
[133]Nine NLU tasks including Sentiment Analysis and Natural Language InferenceChatGPTZSGeneral, Social MediaEnglishNo
[134]Paraphrase Detection, Sentiment Analysis, Natural Language InferenceChatGPTZS,FSGeneralEnglishNo
[135]Sentiment Analysis, Natural Language InferenceGPT-3, GPT-3.5, ChatGPTZS, FSGeneral, Social MediaEnglishNo
[136]Financial News Classification , Sentiment AnalysisChatGPT, GPT-4ZSFinanceEnglishNo
[137]Natural Language InferenceChatGPT, GPT4ZS,FSHealthcareEnglishNo
[138]Natural Language Inference, Document Classifica- tionGPT3.5, GPT4, BardZS, FSHealthcareEnglishNo
[139]Hate Speech DetectionGPT-3ZS, FSSocial MediaEnglishNo
[140]Implicit Hate Speech DetectionChatGPTZSSocial MediaEnglishNo
[141]Clinical Text ClassificationGPT-3, ChatGPT, GPT-4ZS, FSHealthcareEnglishNo
[142]Sentiment Analysis, Suicide Tendency Detection, Personality PredictionChatGPTZSSocial MediaEnglishNo
[143]Intent ClassificationGPT-3ZSSocial MediaEnglishNo
[144]News Classification, Sentiment AnalysisInstructGPTZS, FSGeneral, Social MediaEnglishYes

TABLE 1. Summary of research works exploring GLLMs for various text classification problems. Here ZS represents zero-shot, and FS represents few-shot.

Research works exploring GLLMs for text classification. The recent works explored GLLMs like GPT- 3, GPT-3.5 ChatGPT and GPT-4 for various text classification problems like sentiment analysis [128], [129], [132], [134], [136], [142], [144], stance detection [125] intent classification [143], mental health analysis [126], [127], hate speech detection [139], [140], misinformation detection [132], paraphrase detection [134], news classification [136], natural language inference [134], [137], [138]etc. The evaluation is done in zero and few-shot settings using different prompting strategies like chain- of-thought (CoT) [125], [127], [134], [137], [138], [141], [144], self-question prompting (SQP) [138], clue and reasoning prompting (CARP) [144] etc. Most of the research works focused on English datasets, except a few research works focused on other languages like Chinese [128], Slovenian [131], Indonesian [132], Javanese [132], and Buginese [132]. A brief summary of research works exploring GLLMs for various text classification problems is presented in Table 1.

Most of the research works showed that compared to direct prompting, advanced prompting strategies help the model to achieve better results. This is because advanced prompting involves generating intermediate outputs, which in turn guide the model in generating the correct final output. Zhang et al. [125] explored the ChatGPT model with direct and chain-of-thought prompting for stance detection in tweets in zero and few-shot settings. Experiment results on three datasets showed that one-shot chain of thought prompting outperforms zero-shot direct prompting and also achieves near state-of-the-art results. Yang et al. [127] designed emotion-enhanced CoT prompting to combine emotion information with the power of CoT prompting for mental health analysis tasks. Experiments on five different mental health analysis tasks showed that ChatGPT with emotion-enhanced CoT outperforms other prompting strategies. Overall, ChatGPT outperforms traditional deep learning models like CNN and RNN but still lags behind task-specific fine-tuned models. Wu et al. [137] explored models like GPT-4 and ChatGPT for radiology natural language inference task. The authors reported that GPT-4 with IRSA prompting strategy outperforms ChatGPT in both zero and few-shot settings. IRSA stands for Instruction Response Semantic Alignment. IRSA prompting strategy is almost the same as direct prompting except that in the case of IRSA prompting, the model is instructed to give the labels “contain” and “not contain” instead of “entailment” and “not entailment”, just to reduce the complexity. Wang et al. [138] evaluated the performances of the latest LLMs like GPT-3.5, GPT-4, and Bard models on text classification tasks like natural language inference and document classification in the healthcare domain. The GPT-4 model with the newly designed self-question prompting (SQP) outperforms other models in both zero and few-shot settings. The SQP strategy involves identifying the key elements of input, generating questions and answers related to the key elements, and then using them to generate the final output. Parikh et al. [143] showed that the performance of the GPT-3 model for intent classification in zero- shot settings can be enhanced by including intent class descriptions in the prompt.

Some of the research works demonstrated that GPT-3 family large language models can outperform task- specific fine-tuned models [131], [134] and domain- specific LLMs [136]. Kuzman et al. [131] showed that ChatGPT outperforms fine-tuned XLM-R model in the task of automatic genre identification in the English language. Zhong et al. [134] compared the performances of ChatGPT and fine-tuned models based on base and large versions of BERT and RoBERTa models on tasks like natural language inference, sentiment analysis and paraphrase identification. The results showed that Chat- GPT outperforms both base and large fine-tuned models by a large margin in the case of natural language inference task. Li et al. [136] evaluated the performances of general LLMs like ChatGPT and GPT-4 and domain- specific LLMs like BloombergGPT on tasks like finance news classification and sentiment analysis. In the case of finance news classification, GPT-4 outperforms all other LLMs, including the domain-specific BloombergGPT model.

In all the above discussed research works, the performance of GLLMs is impressive but still lags behind SOTA results. Sun et al. [144] showed that it is possible to achieve SOTA results in text classification tasks with the newly designed clue And reasoning prompting (CARP) prompting strategy. CARP involves a progressive reasoning approach for handling complex linguistic phenomena, and it involves three steps: finding clues based on input, generating reasoning steps based on the input and the generated clues, and then arriving at the final output based on the input, generated clues and reasoning steps. Experiment results showed that the results are impressive as InstructGPT with CARP prompting strategy using just 16 examples achieves SOTA results on four text classification datasets.

4.2       Information Extraction

Overview. Information Extraction (IE) in natural language processing involves extracting structured data like entities, relationships and events from unstructured text data [164]. Transforming unstructured text data into structured data enables efficient data processing, knowledge discovery, decision making and enhances information retrieval and search. Information extraction involves a number of tasks like entity typing, entity extraction, relation classification, relation extraction, event detection, event argument extraction and event extraction [153]. Entity typing (ET) involves classifying identified named entity mentions into one of the predefined entity types [165]. Named Entity Recognition (NER) or Entity Extraction (EE) involves identifying entity mentions and then assigning them to appropriate entity types [166]. Relation classification (RC) involves identifying the se- mantic relationship between the given two target entities in a sentence [167]. Relation Extraction (RE) involves extracting the entities and then classifying the semantic relationship between the two target entities, i.e., involves entity extraction followed by relation classification [168]. Event Detection (ED) aims to identify and categorize words or phrases that trigger events [169]. Event Argument Extraction (EAE) involves identifying event arguments, i.e., entities involved in the event and then classifying their roles [170]. Event Extraction (EE) aims to extract both the events and the involved entities, i.e., it involves event detection followed by event argument extraction [171].

Research works expoloring GLLMs for information extraction tasks The recent works explored GPT- 3 family large language models for various information extraction tasks like entity typing [153], entity extraction [136], [138], [146]–[149], [153], [158]–[160], [162], relation classification [138], [149], [153]–[156], [163], relation extraction [148], [151]–[153], [158], [161], [162], event classification [153], event argument extraction [153] and event extraction [148], [150], [153], [158]. The evaluation is done in zero and few-shot settings using different prompting strategies like chain-of-thought (CoT) [138], [152], [156], [161], self-verification [159], self-question prompting (SQP) [138], event ranking (ER) [152] etc. Most of the research works focused on English datasets, except a few research works focused on other languages like Chinese [148]. A brief summary of research works exploring GLLMs for various information extraction tasks is presented in Table 2.

Hu et al. [147] demonstrated the performance of Chat- GPT in extracting clinical entities like problem, treatment, and test can be enhanced by including additional information about entity types like synonyms and sub- types in the prompt. Wei et al. [148] proposed ChatIE, a two-stage framework for information extraction, with each stage implemented as a multi-turn question answering. This two-stage framework helps the model break complex IE tasks into sub-tasks which allows the

PaperTask(s)GLLMs ExploredPrompt SettingsDomain(s)Language(s)SOTA Results
[146]Entity ExtractionChatGPTZSGeneralEnglishNo
[147]Entity ExtractionGPT-3, ChatGPTZSHealthcareEnglishNo
[148]Entity Extraction, Event Extraction, Relation Ex- tractionChatGPTZSGeneralEnglish, ChineseNo
[149]Entity Extraction, Relation ClassificationGPT-3FSHealthcareEnglishNo
[150]Event ExtractionChatGPTFSGeneralEnglishNo
[151]Protein-Protein Interaction ExtractionGPT-3, ChatGPT and GPT- 4ZSHealthcareEnglishNo
[152]Temporal Relation ExtractionChatGPTZSGeneralEnglishNo
[153]Entity Typing, Entity Extraction, Relation Classifi- cation, Relation Extraction, Event Detection, Event Argument Extraction, Event ExtractionChatGPTZSGeneralEnglishNo
[154]Temporal Relation Classification, Causal Relation Classification, Discourse Relation ClassificationChatGPTZS, FSGeneralEnglishNo
[155]Relation ClassificationGPT-3.5FSGeneral, Scientific LiteratureEnglishYes
[156]Relation ClassificationGPT-3.5FSGeneral, Scientific LiteratureEnglishYes
[157]Entity ExtractionGPT-3.5, ChatGPTZSGeneralEnglishNo
[135]Entity Extraction, Relation ExtractionGPT-3, GPT-3.5, ChatGPTZS, FSGeneral, Social EediaEnglishNo
[158]Entity Extraction, Relation Extraction and Event DetectionInstructGPTFSGeneralEnglishYes
[159]Entity ExtractionGPT-3FSGeneralEnglishNo
[138]Entity Extraction, Relation ClassificationGPT-3.5, GPT-4ZS, FSHealthcareEnglishNo
[160]Entity ExtractionGPT-3ZSGeneralEnglishNo
[161]Relation ExtractionGPT-3FSGeneral, HealthcareEnglishNo
[136]Entity extractionChatGPT, GPT-4FSFinanceEnglishNo
[162]Entity Extraction, Relation ExtractionGPT-3, CodexFSGeneral, Scientific LiteratureEnglishNo
[163]Relation ClassificationGPT-3.5, ChatGPTZSGeneralEnglishNo

TABLE 2. Summary of research works exploring GLLMs for information extraction tasks. Here ZS represents zero-shot, and FS represents few-shot.

model to perform better. Results showed that ChatGPT used with the ChatIE framework outperforms vanilla ChatGPT by a large margin of more than 18 points. Gutierrez et al. [149] enhanced the performance of the GPT-3 model for entity extraction and relation classification by using techniques like contextual calibration [172] to reduce bias and kNN-based demonstration selection. Gao et al. [150] examined the performance of ChatGPT for event extraction in few-shot settings. The model is prompted with task descriptions, definitions of event types, positive and negative examples, and test input. The authors reported that including negative examples decreases the performance of the model, which is in line with other existing works [173]. The possible reason for this is that the model misunderstands negative examples as positive examples. Rehana et al. [151] explored GPT-3 family models like GPT-3, ChatGPT and GPT-4 for protein-protein interaction extraction. It is reported that including normalized protein names in the prompt enhances the performance of the model. However, fine-tuned PubMedBERT model outperforms GPT-4 model with an F1-score of 86.47.

Yuan et al. [152] demonstrated that advanced prompting strategies like event ranking and chain-of-thought improve the performance of ChatGPT compared to vanilla prompting in temporal relation extraction. How- ever, ChatGPT lags behind traditional neural networks like LSTM and fine-tuned pre-trained language models, which indicates the toughness of the temporal relation extraction task. Wang et al. [138] evaluated the performances of the latest LLMs like GPT-3.5, GPT-4, and Bard models on entity extraction and relation classification in the clinical domain. Experiment results showed that GPT-4 with self-question prompting outperforms other LLMs on most of the datasets. Li et al. [162] compared the performances of both natural language and code LLMs like GPT-3 and Codex using natural language and code style prompts. Experiment results showed that (i) Codex outperforms GPT-3 model and moderately sized fine-tuned models and (ii) Codex model with natural language or code style prompt outperforms GPT-3 model (iii) Code style prompts achieves better results in case of both Codex and GPT-3 models. The possible ex- planation for this is Codex which is pretrained over large volumes of code, encode structured code information which is useful for IE tasks as IE tasks involve structured outputs. Zhang et al. [163] proposed the QA4RE frame- work, which frames relation extraction as a question-answering problem. In the QA4RE framework, the sentence serves as context, and the relation types serve as options from which the LLMs choose. Experiment results showed that the proposed approach improves the performance of ChatGPT and GPT-3.5 models by a good margin in relation extraction.

Some of the research works [155], [156], [158] demonstrated that GPT-3 family models can achieve SOTA results in information extraction tasks. Wan et al. [156] achieved SOTA results in relation extraction with the GPT-RE framework. GPT-RE framework overcomes the drawbacks in existing works using entity-aware demonstration retrieval based on fine-tuned model and gold label-induced reasoning. The use of representations from fine-tuned relation model for demonstration selection is more effective as they naturally include entity and relation information. Ma et al. [158] proposed a “filter then rerank” approach to use both fine-tuned models and LLMs to take advantage of the strengths of both models for few-shot information extraction. Here fine- tuned model acts as a filter while LLM acts as a re-ranker. The proposed approach achieves SOTA results with an average improvement of over 2 points in the F1 score.

4.3       Question Answering

Overview. Question Answering (QA) is an important natural language processing task which deals with the development of algorithms to understand and interpret user queries in natural language and then deliver ac- curate responses [174], [175]. The main aim of question answering systems is to enhance human-computer interaction, i.e., QA systems avoid the use of complex commands and allow the user to interact with machines in a more natural way through natural language queries. For example, popular AI assistants like Amazon Alexa1, Google Assistant2 and Apple Siri3 rely on QA to provide accurate answers to user queries. The option of inter- action through natural language queries enhances the reach of technology to a broader audience. QA can be treated as a fine-grained version of information retrieval [176], and the demand for QA systems is increasing day by day because of the ability to generate answers which are accurate, relevant and short.

Research works exploring GLLMs for question answering tasks. The NLP research community explored GLLMs for question answering in various domains like education [177], [184], news [180], healthcare [138], [182], [183], [185], [186], [190], [191], [193], [195], social media [135], coding [187], legal [188], [194], finance [136] and scientific literature [189]. Most of the research works focused on the English language, except a few research works focusing on languages like Portuguese [177], Japanese [191], [195] and Chinese [193]. As advanced prompting methods allow GLLMs to perform well, some of the research works investigated the effectiveness of advanced prompting strategies like chain- of-thought [138], [177], [178], [183], [189], [195], self-question prompting [138], [193] and holistically thought [193] for question answering. Table 3 presents a summary of research works exploring GLLMs for question answering across various domains and languages.

Zheng et al. [181] studied the shortcomings of Chat- GPT in answering complex open-domain questions and found errors related to understanding, factual accuracy, specificity, and logical reasoning. They also analyzed the importance of knowledge memorization, recall, and reasoning abilities in addressing these failures. The authors demonstrated that providing the model with external knowledge, cues for knowledge recall, and guidance for logical reasoning can enhance its ability to provide more accurate answers. Samaan et al. [182] examined the accuracy of ChatGPT in answering questions related to Bariatric surgery. The authors reported that ChatGPT correctly answered 131 questions from 151 questions, i.e., ChatGPT achieves an accuracy of 86.8%. The impressive performance of ChatGPT shows that it can serve as an additional information resource in addition to healthcare professionals and reduce their burden in answering patient questions. Holmes et al. [183] compared the performances of GLLMs like ChatGPT, GPT-4 with other LLMs like Bard, BLOOMZ and medical physicists in answering related questions to Radiation Oncology Physics. The performance of GPT-4 is very impressive as the model outperforms medical physicists and other LLMs like ChatGPT, Bard and BLOOMZ. The performance of GPT-4 is further enhanced using CoT prompting, i.e., the model is prompted to arrive at the answer after step- by-step reasoning. Nori et al. [185] performed a com- prehensive evaluation of the GPT-4 model on medical question answering in zero and few-shot settings. For evaluation, the authors used six datasets: two related to the United States Medical License Examination (USMLE) exam and four from the MultiMedQA benchmark [116]. The performance of GPT-4 is very impressive as it outperforms not only general LLM like GPT-3.5 but also medical domain-specific LLM like Med-PaLM [116]. Moreover, on USMLE exam datasets, GPT-4 model score is 20 points more than the passing score.

Hamidi et al. [186] evaluated ChatGPT and Claude

PaperTask(s)GLLMs ExploredPrompt SettingsDomain(s)Language(s)SOTA Results
[177]Admission Exam Question AnsweringGPT-3.5, ChatGPT, GPT-4ZS, FSEducationBrazilian PortugueseNo
[178]Knowledge-based Complex Question AnsweringGPT-3, GPT-3.5, ChatGPTZSGeneralMultiple languagesNo
[179]Knowledge-based Visual Question AnsweringGPT-3ZSGeneralEnglishYes
[180]Tabular Question AnsweringGPT-3ZS, FSNewsEnglishNo
[181]Open Domain Question AnsweringChatGPTZSGeneralEnglishNo
[182]Bariatric Surgery Question AnsweringChatGPTZSHealthcareEnglishNo
[183]Radiation Oncology Physics Question AnsweringChatGPT, GPT-4ZSHealthcareEnglishNo
[184]Computer Science Question AnsweringChatGPTZSEducationEnglishNo
[185]Medical Question AnsweringGPT-3.5, GPT-4ZS, FSHealthcareEnglishNo
[186]Patient-specific Question AnsweringChatGPTZSHealthcareEnglishNo
[132]Question AnsweringChatGPTZSGeneralEnglishYes
[157]Boolean Question AnsweringChatGPTZSGeneralEnglishNo
[133]Multiple Choice Question AnsweringChatGPTZSGeneral, Social MediaEnglishNo
[135]Question AnsweringGPT-3, GPT-3.5, ChatGPTZS, FSGeneralEnglishNo
[187]Multiple Choice Code Question AnsweringGPT-3.5ZSCodingEnglishNo
[188]Bar Exam Question AnsweringGPT-3.5ZSLegalEnglishNo
[189]Multi-Document Question AnsweringGPT-3.5FSGeneral, Scientific LiteratureEnglishNo
[190]Plastic Survey Exam Question AnsweringChatGPTZSHealthcareEnglishNo
[191]Japanese Medical Exam Question AnsweringGPT-3.5, GPT-4FSHealthcareJapaneseNo
[136]Financial Question AnsweringChatGPT, GPT-4ZSFinanceEnglishNo
[138]Medical Question AnsweringGPT-3.5, GPT4ZS, FSHealthcareEnglishNo
[192]Multiple Choice Question AnsweringGPT-3, Codex, InstructGPTZSGeneralEnglishNo
[193]Medical Conversational Question AnsweringGPT-3, InstructGPTZSHealthcareEnglish, ChineseNo
[194]Question AnsweringGPT-3ZSMultiple domains including Legal           and HealthEnglishNo
[195]Japanese Medical Exam Question AnsweringGPT-3, ChatGPT, GPT-4FSHealthcareJapaneseNo

TABLE 3. Summary of research works exploring GLLMs for question answering tasks. Here ZS represents zero-shot, and FS represents few-shot.

in answering patient-specific medical questions from MIMIC-III clinical notes. Experiment results demon- strated that the performances of both models are promising as these models display significant levels of coherence, accuracy, coverage and relevance in their answers. Li et al. [136] demonstrated that GPT4 achieves the best results for question answering in the finance domain and outperforms ChatGPT, domain-specific models like BloombergGPT, FinQANet and general LLMs like OPT (66B), and BLOOM (176B). Although the performance of GLLMs is impressive in zero and few-shot settings in multiple choice question answering, these models still lag behind SOTA results. The main reason for this is the use of cloze prompts. In cloze prompts, the model is prompted with only question without answer options, so the model generates the answers just by conditioning on the question. Robinson et al. [192] proposed a new prompting strategy called multiple choice prompt which prompts the model with question and answer options so that the model generates the answer by conditioning on both question and answer options. Evaluation on 20 datasets showed that multiple-choice prompt helps GLLMs to achieve near SOTA results.

Some of the research works explored the effectiveness of GLLMs in answering exam questions from various domains. Nunes et al. [177] investigated the performances of GLLMs like GPT-3.5, ChatGPT and GPT-4 in answering questions from the Brazilian university admission exam. Here all the questions are in Brazilian Portuguese language. The authors explored different prompting strategies like vanilla (zero-shot and few- shot) and CoT (few-shot). The authors observed that GPT-4 outperforms all other models by a large margin of over 11 points and achieves the best results with CoT prompting in few-shot settings. Joshi et al. [184] evaluated ChatGPT in answering undergraduate-level computer science exam questions. For the evaluation, the authors gathered (i) questions from various computer science subjects like data structures, operating systems, machine learning and database management systems, (ii) questions from the GATE exam and (iii) programming questions from the Leetcode website. The results showed that ChatGPT is inconsistent in answering the questions, so students are not advised to rely on ChatGPT completely for their assignments and exams. Bommarito et al. [188] examined the ability of OpenAI’s text-davinci-003 (GPT-3.5) model in answering multiple choice questions from the Bar Exam. Interestingly, human participants with extensive education and specialized training achieved a 68% accuracy rate, while the GPT-3.5 model achieved a lower accuracy rate of 50.3%. Gupta et al. [190] evaluated how effective ChatGPT is in answering questions from plastic surgery inservice training examination. The authors reported that ChatGPT achieves an accuracy of 54.96% by correctly answering 242 questions. Tanaka et al. [191] evaluated the performances of GLLMs like GPT-3.5 and GPT-4 in answering questions from the Japanese National Medical Licensing Examination (NMLE). Here the input includes sample examples, instructions to translate the question into English, and then summarizing the question before answering. The authors reported that GPT-4 achieves a score better than the minimum passing score, and further analysis showed that the incorrect answers are due to insufficient medical knowledge and insufficient information about the Japanese-specific medical system. Kasai et al. [195] reported that GPT-4 outperforms other models and passes the Japanese national medical li- censing exam in the last six years. Moreover, ChatGPT with English-translated prompts achieves better results than ChatGPT with Japanese prompts. This is because ChatGPT is predominantly trained over the English text corpus.

Some of the research works explored GLLMs for more challenging tasks in question answering like tabu- lar question answering [180], knowledge-based complex question answering [178], multiple choice code question answering [187], multi-document question answering [189] and conversational question answering [193]. Srivastava et al. [180] evaluated the effectiveness of GPT- 3 for question answering on tabular data in zero and few-shot settings. Here the model is prompted with unstructured passage text, tabular data in JSON format, examples (in the case of few-shot) and the question. The authors reported that GPT-3 displayed its ability to successfully locate the table, comprehend its structure, and accurately access the relevant cells or passages of text in order to provide answers to the given questions. Savelka et al. [187] evaluated the effectiveness of GPT-3.5 models in answering multiple-choice questions (MCQs), particularly those involving code snippets from programming courses. Experiment results showed that MCQs with code snippets have lower success rates compared to those without code, indicating a challenge in answering multiple-choice questions with code snippets. Pereira et al.[189] presented Visconde, a novel framework based on the GPT-3.5 model to tackle multi-document question answering. Visconde follows a three-step process involving decomposition, retrieval, and aggregation. The decomposition phase uses the GPT-3.5 model in few- shot settings for question simplification, the retrieval stage uses the SOTA model to select the relevant text chunks, and the final aggregation phase uses the GPT-3.5 with few-shot CoT prompting to get the answer. The authors observed that CoT prompting, i.e., generating reasoning steps before generating the final answer, enhances the performance. Weng et al. [193] enhanced the performance of GLLMs in answering medical conversational questions in English and Chinese using a novel prompt strategy called Holistically Thought (HoT). The HoT prompting strategy involves diffused thinking and focused thinking strategies to generate high-quality responses. Diffused thinking helps to generate various responses through diversified decoding, focused thinking generates a concise medical summary based on the dialogues and the final response is generated based on the dialogues, outputs of diffused thinking and focused thinking.

Unlike all the above discussed research works where the performances of GLLMs are just satisfactory but not SOTA, some of the research works [132], [179] demonstrated that it is possible to achieve SOTA results for question answering task using GLLMs. For example, Yang et al. [179] explored GPT-3 model for knowledge- based visual question answering. Knowledge-based visual question answering involves answering questions which require information which is not available in the input images. The authors propose a novel approach which uses GPT-3 as a knowledge source which is implicit and unstructured. Experiment results showed that the proposed approach achieves new SOTA results by outperforming existing approaches with a large margin of over 8 points.

4.4       Machine Translation

Overview. Machine Translation (MT), an important task of natural language processing, deals with the development of models which can translate input text from the source language to the target language [209]– [211]. MT models receive the input text in the source language, understand the syntax and semantics of the input text and then generate the translation in the tar- get language. So, a good machine translation model should possess strong natural language understanding and generation skills to generate quality translations. The main objective of MT systems is to enhance crosslingual communication by reducing the gap between individuals from different linguistic communities. The

PaperGLLMs ExploredPrompt SettingsDomain(s)Language(s)GranularityOutperforms Commercial Systems
[196]ChatGPTZSGeneralJapanese, ChineseSentenceNo
[197]ChatGPTZSGeneral, News, HealthcareEnglish, Chinese, German, RomanianSentenceNo
[198]ChatGPT, GPT-4ZSGeneral, Healthcare ,Social Me- diaEnglish, Chinese, German, RomanianSentenceYes
[199]InstructGPT, Chat- GPT, GPT-4ZS,FSNews,        Social       Media,       E- Commerce, DialogueEnglish, German, ChineseSentence, DocumentYes
[200]ChatGPTZS, FSGeneral, News, Social Media, Dialogue, E-CommerceEnglish, French, SpanishSentenceYes
[201]ChatGPT, GPT-4ZSGeneral, Social Media, News, DialogueEnglish, German, RussianDocumentYes
[202]ChatGPTZS, FSGeneral102 Languages in 202 direc- tionsSentenceNo
[203]ChatGPTZSGeneralEnglish, Chinese, FrenchParagraphNo
[132]ChatGPTZSGeneralTwelve languages, includ- ing four low-resource lan- guagesSentenceNo
[204]GPT-3.5ZSGeneral18 language Pairs, includ- ing Japanese, English and PolishSentence, ParagraphYes
[205]GPT-3.5ZS, FSGeneralEnglish, Arabic, Chinese, German, SpanishSentenceYes
[206]GPT-3.5ZS, FSGeneralEnglish, Chinese, Japanese, German, FrenchSentenceNo
[207]GPT-3.5, GPT-4ZSGeneralEnglish, German, ChineseSentenceYes
[208]GPT-3.5ZSGeneralEnglish, German, RussianSentenceYes

TABLE 4. Summary of research works exploring GLLMs for machine translation. Here ZS represents zero-shot, and FS represents few-shot.

evolution of MT systems started with rule-based models followed by statistical and neural models [211]. Rule- based MT systems are built on top of manually crafted syntactic and grammatical rules. As manually framing rules is heavily laborious and expensive, these systems are later replaced by statistical MT systems. Statistical MT systems use statistical models trained on bilingual data. With the evolution of deep learning models, the research community started to build neural machine translation (NMT) systems with the help of neural models [12], [14], [212]. These neural models are essentially based on the encoder-decoder architecture, where the encoder understands the input sequence and encodes it into a vector, and the decoder, based on the encoder output, generates the output sequence auto-regressively. Some of the recent neural models used for translation are mBART-50 [213], M2M100 [214], NLLB200 [215] etc.

Research works exploring GLLMs for machine translation. In recent times, GLLMs like ChatGPT and GPT- 4 demonstrated remarkable performances in both natural language understanding and generation tasks. A good machine translation system requires strong natural language understanding and generation skills. As ChatGPT and GPT-4 possess strong natural language understanding and generation skills, the research com- munity investigated the effectiveness of these models for machine translation across various domains like news [197], [199]–[201], healthcare [197], [198], social media [198]–[201], dialogue [199]–[201] and e-commerce [199], [200]. Most of the research works focused on sentence- level machine translation [132], [196]–[200], [202], [204]– [208], except a few research works focused on paragraph- level machine translation [203], [204] and document-level machine translation [199], [201]. As advanced prompting methods allow GLLMs to perform well, some of the re- search works investigated the effectiveness of advanced prompting strategies like pivot [198], chain-of-thought [207] and multi-aspect prompting and selection [206]. Table 4 presents a summary of research works exploring GLLMs for machine translation across various domains and languages.

Gu et al. [196] proposed a novel approach based on ChatGPT to enhance the quality of translation from Japanese to Chinese by effectively handling attribute clauses using a pre-edit scheme. The proposed approach, which integrates the pre-edit scheme with a novel two- step prompting strategy, enhances the translation quality by more than 35%. Peng et al. [197] explored the impact of temperature, task and domain information on the translation performance of ChatGPT. The authors showed that (i) ChatGPT performance degrades with an increase in temperature, and hence it is recommended to use a lower temperature (recommended is 0). and (ii) including task and domain information in the prompt enhances the performance of ChatGPT consistently for both high and low language translations. Zhu et al.

[202] evaluated the performance of ChatGPT and other LLMs like OPT, BLOOM and XGLM on 102 languages in 202 translation directions. The authors reported that ChatGPT comprehensively outperforms other LLMs but still lags behind neural machine translation models like NLLB in the majority of the translation directions. Further analysis showed three errors, namely hallucination, monotonic translation and off-target translation. Lyu et al. [203] presented some interesting research directions with respect to using LLMs for machine translation. The presented interesting research directions include stylized machine translation, interactive machine translation and translation memory-based machine translation. Neural machine translation systems just focus on source-target text mapping, which results in a lot of errors. Unlike neural machine translation systems, the human translation process involves intermediate steps to ensure high translation quality. Inspired by the human translation process, He et al. [206] proposed MAPS, which involves three steps: knowledge mining, knowledge integration and knowledge selection to generate quality translations. Extension evaluation of the WMT22 test set shows that MAPS improves the performance of models like GPT-3.5 and Alpaca and also addresses the hallucination issue by resolving 59% of hallucination errors.

In all the above discussed research works, the performances of GLLMs are just satisfactory but not on par or beyond the performances of commercial machine translation systems. Some of the research works [198]– [201], [204], [205], [207], [208] showed that it is possible to outperform commercial machine translation systems using GLLMs. For example, Jiao et al. [198] investigated the translation capabilities of GLLMs like ChatGPT and GPT-4 and compared the performance with commercial systems like Google Translate, DeepL Translate and Tencent TranSmart. Extensive evaluation of multiple datasets showed that (i) the performance of GLLMs is on par with commercial systems in the case of high resources languages only, and (ii) the translation quality of low-resource languages can be enhanced using a novel pivot prompting strategy, which involves translating into high resource language before translating into the target low resource language. The naive prompts are unable to elicit the translation ability of ChatGPT fully. So, Gao et al. [200] focused on developing advanced prompting strategies by including additional information like task information, domain information and syntactic information like PoS (parts of speech) tags. The authors showed that ChatGPT, with the proposed advanced prompting strategy, achieves promising results and even outperforms commercial systems like Google Translate and DeepL Translate. Wang et al. [201] examined the performances of ChatGPT and GPT-4 for document-level machine translation and also compared the results with commercial systems from Google, DeepL and Tencent. The authors reported that GLLMs do well when the sentences in the document are combined and given at once to the model. Moreover, with this prompting strategy, both the GLLMs exhibit better performances than commercial machine translation systems according to human evaluation and also outperform most document- level neural machine translation methods in terms of d- BLEU scores. Karpinska et al. [204] explored the GPT-3.5 model for paragraph-level machine translation. The authors experimented with three different prompting strategies, namely translating sentence by sentence in isolation, translating sentence by sentence in the presence of the rest of the paragraph and translating the entire paragraph at once. After extensive evaluation of 18 language pairs, including English and Japanese, the authors report that translating the entire paragraph at once outperforms other strategies and commercial sys- tems like Google Translate. Raunak et al. [208] examined the differences between the translations generated by GLLMs like GPT-3.5 and NMT systems like Microsoft Translator. The authors reported that GLLM generated translations are less literal, with better scores.

4.5       Keyphrase Generation

Overview. Keyphrase generation (KPG) involves generating a set of phrases that capture the main ideas of a document [225]. The primary advantage of KPG over keyphrase extraction is the ability to generate both ex- tractive and abstractive keyphrases. Keyphrase generation is approached as a sequence-to-sequence generation task [12], [226], [227] in the existing works. The current state-of-the-art model for keyphrase generation is, Key- BART [227], which is based on BART and trained using the text-to-text generation paradigm. Table 5 presents a summary of research works exploring GLLMs for keyphrase generation.

Research works exploring GLLMs for keyphrase generation. Martinez et al. [216] performed a comprehensive evaluation of ChatGPT as a keyphrase generator by evaluating its performance on six datasets using six candidate prompts. The authors reported that the results are promising, but ChatGPT struggles in the case of generating absent keyphrases. Song et al. [217] evaluated ChatGPT on multiple datasets from news and scientific literature domains having both short and long documents. Experiment results showed that ChatGPT outperforms KeyBART [227], the SOTA model, on all the datasets.

4.6       Dialogue Tasks

Overview. Dialogue tasks in natural language processing (NLP) deal with understanding and generating human- like conversations between machines and users [228]. The main objective of these tasks is to enable machines to have conversations with humans in a natural way. These dialogue tasks are essential components of building effective conversational agents, which have a wide range of applications, including customer support

PaperGLLMs ExploredPrompt SettingsDomain(s)Language(s)SOTA Results
[216]ChatGPTZSNews, Scientific LiteratureEnglishYes
[217]ChatGPTZSScientific LiteratureEnglishNo

TABLE 5. Summary of research works exploring GLLMs for keyphrase generation task. Here ZS represents zero-shot, and FS represents few-shot.

PaperTask(s)GLLMs ExploredPrompt SettingsDomain(s)Language(s)
[218]Spoken Language Understanding and Dialogue State TrackingGPT-3.5, ChatGPTZSGeneralEnglish
[219]Emotion Dialogue Understanding and Generation TasksChatGPTZS, FSGeneralEnglish
[220]Dialogue SummarizationGPT-3ZSHealthcareEnglish
[132]Dialogue GenerationChatGPTZSGeneralEnglish
[157]Dialogue SummarizationChatGPTZSGeneralEnglish
[221]Dialogue SummarizationGPT-3FSGeneralEnglish
[222]Dialog EvaluationGPT-3FSGeneralEnglish
[223]Dialogue Discourse AnalysisChatGPTZS, FSGeneralEnglish, Chinese
[224]Dialogue Question AnsweringChatGPTZS, FSGeneralEnglish, Chinese

TABLE 6. Summary of research works exploring GLLMs for various dialogue tasks. Here ZS represents zero-shot, and FS represents few-shot.

[228], [229].

Research works exploring GLLMs for dialogue tasks. The research community explored GLLMs like GPT-3, GPT-3.5 and ChatGPT for various dialogue tasks like dialogue summarization [157], [220], [221] , dialogue question answering [224], emotion dialogue understanding and generation [219], dialogue state tracking [218], dialogue generation [132], and dialogue discourse analysis [223]. Some of the research works explored LLMs for the evaluation of dialogue tasks [222]. Most of the research works focused on general domain and English language datasets, except a few research works which focused on the medical domain [220] and languages like Chinese [223], [224]. Table 6 presents a summary of research works exploring GLLMs for various dialogue tasks.

Pan et al. [218] reported that ChatGPT exhibits better performance in dialogue state tracking compared to spoken language understanding. Further, the authors showed that the performance of ChatGPT can be enhanced by (i) using a multi-turn interactive prompt for dialogue state tracking and (ii) providing additional details like slot names, examples and descriptions for slot filling in spoken language understanding. Zhao et al. [219] explored the emotion dialogue capabilities of ChatGPT by evaluating the model on five different tasks, namely emotion recognition, emotion cause recognition, dialogue act classification (emotion dialogue understanding), empathetic response generation and emotion support generation. It is reported that ChatGPT exhibits better performances in emotion dialogue generation compared to emotion dialogue understanding. Chintagunta et al. [220] showed that the in-house model trained on GPT-3 generated summaries achieves performances comparable to when trained on human-generated summaries. Further, the in-house model trained on mixed summaries (human-generated and GPT-3 generated) achieves better performances than those trained on either one of the summaries.

Prodan et al. [221] proposed a scoring system to choose the best examples for dialogue summarizing using few-shot GPT-3. The proposed scoring system enhances the quality of generated summaries with an 11% reduction in failures. Huynh et al. [222] studied the impact of various aspects influencing the performance of LLMs as Dialog evaluators. The authors reported that the performance as a dialogue evaluator largely depends on the diversity and relevance of the datasets used for instruction tuning. Fan et al. [223] investigated the effectiveness of ChatGPT for dialogue discourse analysis by evaluating its performance on three tasks, namely topic segmentation, discourse parsing and discourse relation recognition. ChatGPT’s performance is promising in the case of topic segmentation, and CoT prompting enhances the performance. Wang et al. [224] proposed a novel approach based on explicit CoT prompting and demonstration selection to answer dialogue questions in few-shot settings.

PaperTask(s)GLLMs ExploredPrompt SettingsDomain(s)Language(s)SOTA Results
[230]Passage Re-rankingGPT-3, GPT-3.5, ChatGPT, GPT-4ZS, FSGeneral, News, Healthcare, Scien- tific LiteratureEnglish, Ten Low Resource LanguagesYes
[231]Document RetrievalGPT-3.5ZS, FSGeneralEnglishYes

TABLE 7. Summary of research works exploring GLLMs for information retrieval tasks. Here ZS represents zero-shot, and FS represents few-shot.

4.7       Information Retrieval

Information retrieval (IR) involves accessing and retrieving relevant information from large volumes of data. Here, the main objective is to provide users with the most relevant information by matching their queries to the content of documents and ranking them based on relevance [232]. The process includes indexing, query formulation, search and retrieval, ranking, and presentation. Information retrieval is utilized in a wide range of fields, such as web search engines, digital libraries, e-commerce, healthcare, and scientific research [232]. It plays a vital role in facilitating efficient and effective access to information in the modern digital era. Table 7 presents a summary of research works exploring GLLMs for information retrieval.

Sun et al. [230] explored the effectiveness of GPT-3 family models like GPT-3, GPT-3.5, ChatGPT and GPT-4 for passage re-ranking in information retrieval. The results are promising as GPT-4 outperforms SOTA models like monoT5-3B [233] on multiple benchmarks. Moreover, the compact model trained on ChatGPT- generated data demonstrates superior performance compared to the monoT5-3B model when evaluated on the MS MARCO dataset in BEIR [234] benchmark. The existing approaches for document retrieval employ dual dense encoders, which encode query and document independently, resulting in shallow interaction between query and document [235]. To overcome this drawback, Ziems et al. [231] proposed a novel approach which involves generating URLs using LLMs for document retrieval. The authors reported that document retrieval by generating URLs outperforms existing approaches.

4.8       Recommendation Systems

Overview. Recommendation systems aim to reduce in- formation overload and enhance the user experience by making relevant recommendations related to products or content based on user preferences and behaviour [236]. In recent times, recommendation systems have gained immense popularity and are extensively utilized across a range of fields, such as entertainment, e-commerce, social media etc. For example, popular platforms like YouTube and Netflix use recommendation systems to suggest relevant videos and platforms like Amazon use recommendation systems to suggest relevant products to the user [237]. The commonly used approaches for recommendation systems are based on collaborative filtering [238], content-based [239] and knowledge-based [240]. The performance of traditional recommendation systems is limited by a number of issues like cold-start problem, poor generalization across domains and lack of explainability [241], [242].

To overcome these drawbacks in traditional recommendation systems, recent works explored GPT-3 family large language models for various tasks in recommendation systems like next item prediction [243], rating prediction [241], [244], top-k predictions [241], direct recommendation [245], sequence recommendation [245] and generating explanations [245]. The evaluation is done in a variety of domains like movies [241], [243], [246]–[249], news [246], books [244], [246], [247], music [246], [248], social media [250], beauty [245], and games [249]. Table 8 presents a summary of research works exploring GLLMs for recommendation systems.

Research works exploring GLLMs for recommendation systems. Wang et al. [243] proposed a novel prompting strategy called “Next-Item Recommendation (NIR)” to recommend movies using GLLMs. The pro- posed prompting strategy involves a three-step process to capture the user’s preferences, choose representative movies they have watched in the past, and provide a ranked list of ten recommended movies. Dai et al.

[246] reported that ChatGPT outperforms other GLLMs and is more effective with pair-wise and list-wise ranking compared to point-wise ranking. When it comes to balancing cost and performance, ChatGPT with list- wise ranking outperforms both point-wise and pair-wise ranking approaches. ChatGPT demonstrates the potential for providing explanations for recommendations and addressing the challenges of the cold start problem. Gao et al. [241] proposed Chat-REC, which leverages GLLMs to build conversational recommendation systems. The authors reported that Chat-REC performs well in tasks like top-k recommendations and zero-shot rating prediction. Moreover, Chat-REC enhances the conversational recommendation systems by making them more interactive and providing clear explanations.

Mysore et al. [250] explored GLLMs like InstructGPT to generate synthetic data, and the experiment results showed that narrative-driven recommendation models trained on augmented datasets outperform LLM base- lines and other approaches. Kang et al. [247] evaluated GLLMs like GPT-3.5 and ChatGPT on user rating prediction in zero and few-shot settings. Based on the experimental findings on datasets from movies and book domains, the authors reported that traditional models that have access to user interaction data perform better

PaperGLLMs ExploredPrompt SettingsDomain(s)Language(s)SOTA Results
[243]GPT-3.5ZSMoviesEnglishNo
[246]GPT-3.5, ChatGPTZS, FSNews, Books, Movies, MusicEnglishNo
[241]GPT-3.5, ChatGPTZSMoviesEnglishNo
[250]InstructGPTFSSocial MediaEnglishNo
[247]GPT-3.5, ChatGPTZS, FSMovies, BooksEnglishNo
[248]ChatGPTZSMusic, MoviesEnglishNo
[245]ChatGPTZS, FSBeautyEnglishYes
[249]ChatGPTZSMovies, GamesEnglishNo
[244]ChatGPTZS, FSBooksEnglishNo

TABLE 8. Summary of research works exploring GLLMs for recommendation systems. Here ZS represents zero-shot, and FS represents few-shot.

than GLLMs. Zhang et al. [248] introduced FaiRLLM, a new benchmark having eight sensitive attributes from domains like movies and music, to investigate the fair- ness of GLLM recommendations. The authors reported that GLLM-based recommendation systems are not fair to certain sensitive attributes.

Liu et al. [245] evaluated the performance of ChatGPT in five recommendation tasks, which include predicting ratings, direct recommendation, sequence recommendation, generating explanations, and summarizing reviews. Based on the evaluation of Amazon beauty datasets, the authors reported that (i) ChatGPT is much better in rating prediction compared to other tasks like direct and sequence recommendation. and (ii) ChatGPT achieves new SOTA results in generating explanations based on human evaluation. Hou et al. [249] demonstrated that GLLMs possess strong potential for zero-shot ranking tasks, showcasing performance that is comparable to or even superior to traditional recommendation models. Here, the authors designed the prompts in a way that important information like candidate items, sequential interaction history and ranking instruction is included. Zhiyuli [244] proposed BookGPT, a novel framework which leverages GLLMs like ChatGPT for book recommendation. Specifically, the performance of BookGPT is evaluated on three sub-tasks, namely the book rating task, book summary recommendation task and user rating recommendation task. The performance of BookGPT is promising in all three sub-tasks, and the performance increases with an increase in prompt examples.

4.9       Coding Tasks

Overview. Software engineering is a discipline which deals with designing, developing, testing, and maintaining software systems [272]. To create software systems, software engineers use a variety of programming languages, development tools, and technologies. To aid software engineers and enhance their productivity, the research community focused on automating a number of coding tasks like code generation from natural language descriptions, code repair, code explanation generation, code hints generation, code completion, code document generation, test cases generation, code vulnerability detection, code refactoring, etc. The evolution of pre-trained source code models has paved the way for achieving cutting-edge results across coding tasks [455]. Some of the popular pretrained source code models are CodeBERT [86], CodeGPT [273], CoTexT [274], Graph-CodeBERT [275], CodeT5 [87], CodeT5+ [88], PLBART [276], PyCodeGPT [277] etc. Inspired by the success of GLLMs in NLP tasks, the research community focused on assessing the performances of these models in coding tasks also.

Research works exploring GLLMs for various coding tasks. The research community explored GLLMs for cod- ing tasks across various languages like Java [251], [252], [255], [260], [263], [264], [266], [267], [269], [270], Python [253], [254], [256]–[258], [260], [262], [263], [265], [267], [268], [271], PHP [260], GO [260], Ruby [260], JavaScript [260], C [261], [268], C++ [259], [268], Julia [268], and MATLAB [268]. Most of the research works focused on Python and Java languages, while a few research works focused on other languages like GO, PHP, GO, Ruby, JavaScript, C, C++, Julia and MATLAB. The assessment is done in zero and few-shot settings using mostly direct prompts. Table 9 presents a summary of research works exploring GLLMs for various coding tasks.

Some of the research works [253], [257], [259], [268], [269] explored GLLMs for code generation task. Yet- icstiren et al. [253] compared various AI-assisted code generation tools like ChatGPT, Amazon’s Code Whis- perer and Github’s Copilot on the Human Eval [103] dataset. ChatGPT outperforms other tools by generating correct code 65.2% of the time, while the other tools generate correct code for a maximum of 46.3% of the time only. The test cases in existing datasets for code generation evaluation are limited in terms of quality and quantity. So, Liu et al. [257] proposed EvaPlus, a new framework for automatic test case generation using ChatGPT and the traditional mutation approach. The authors use EvaPlus to develop HumanEvalPlus on the top of the HumanEval [103] dataset. The au-

PaperGLLMs ExploredTask(s)Prompt SettingsLanguage(s)SOTA Results
[251]ChatGPTCode RepairZS, FSJavaYes
[252]GPT-3, ChatGPTCode Vulnerability DetectionZSJavaNo
[253]ChatGPTCode GenerationZSPythonNo
[254]ChatGPTFinding Failure-Inducing Test CasesZSPythonYes
[255]ChatGPTCode GenerationZSJava, C#No
[256]GPT-4Code Generation, Code Refactoring, Test Case GenerationZSPythonNo
[257]ChatGPT, GPT-4Code GenerationZSPythonNo
[258]ChatGPTCode Explanation GenerationZSPythonNo
[259]ChatGPTCode GenerationZSC++No
[260]CodexCode Documentation GenerationZS, FSJava, Python, PHP, GO, Ruby, JSYes
[261]GPT-3Code Explanation GenerationZSCNo
[262]ChatGPTCode GenerationZSPythonYes
[263]CodexAutomatic Code RepairZS, FSPython, JavaNo
[264]Codex, ChatGPTUnit Test GenerationZSJavaNo
[265]ChatGPTCode Generation, APR, Code Explanation Gener- ationZSPythonNo
[266]CodexCode Documentation GenerationZS, FSJavaYes
[267]Codex, ChatGPTAutomate Program RepairZSPython, JavaNo
[268]ChatGPTCode GenerationZSC, C++, Python, Julia, MATLABNo
[269]GPT-3.5Code GenerationZSJavaNo
[270]ChatGPTUnit Test GenerationZSJavaNo
[271]ChatGPT, GPT-4Code Repair, Code Completion, Code Explanation Generation, Coding Hints GenerationZSPythonNo

TABLE 9. Summary of research works exploring GLLMs for various coding tasks. Here ZS represents zero-shot, and FS represents few-shot.

thors reported that HumanEvalPlus can detect a lot of incorrectly generated code that was previously undetected. Nascimento et al. [259] compared the quality of code generated by ChatGPT and software developers for competitive coding problems on the LeetCode platform using various evaluation metrics. The authors reported that ChatGPT exhibits better performance compared to novice programmers but is outperformed by experienced programmers. Kashefi et al. [268] explored how effective ChatGPT is for generating code for numerical methods in five different programming languages: C, C++, Python, MATLAB and Julia. The authors observed that the results are promising but have some limitations which require further investigation. Destefanis et al. [269] assessed the code generation ability of LLMs like Bard and GPT-3.5 by evaluating their performances in generating Java language code given the natural lan- guage descriptions. The authors observed that GPT-3.5 outperforms the Bard model by a large margin of more than 37%.

Some of the research works [263], [265], [267], [271] explored GLLMs for code repair task. Prenner et al. [263] explored the Codex model for automatic program repair in Python and Java programming languages. The authors observed that the performance of Codex is comparable to state-of-the-art methods. Moreover, the Codex model is slightly better at fixing errors in Python language compared to Java language. Kang et al. [267] developed AutoSD, a novel framework for automatic program repair using GLLMs. The authors reported that the evaluation on three standard datasets showed that the proposed framework is on par with the baselines.

Unit tests generated using traditional approaches suffer from low readability [270]. To address this drawback, some of the research works [264], [270] explored GLLMs for test case generation. Siddiq et al. [264] evaluated models like Codex and ChatGPT for unit test generation for Java code. Experiment results showed that Codex performs better with 80% coverage for the HumanEval dataset. However, both models perform poorly in the case of the SF110 benchmark, with less than 2% coverage. Yuan et al. [270] designed a ChatGPT-based unit test generation framework called “Chat-Tester”. The iterative test refiner helps Chat-Tester to generate better unit tests compared to vanilla ChatGPT.

In all the above discussed research works, the performance of GLLMs in various coding tasks is promising but still lags behind SOTA results. Some of the research works [251], [254], [260], [262], [266] demonstrated that GLLMs can achieve SOTA results in coding tasks. Xia et al. [251] proposed ChatRepair, an automatic program repair tool based on ChatGPT. ChatRepair achieves remarkable performance, surpassing all the existing methods. It successfully resolves 114 and 48 bugs on Defects4j 1.2 and 2.0 [278], respectively, outperforming the pre- vious best by 15 and 17 bugs, respectively. Khan et al.

[260] explored Codex, GPT-3 family model pretrained on natural and programming languages to automate code documentation generation. The evaluation results on six programming languages showed that Codex, with just one example, outperforms existing approaches by a large margin of 11.2%. Geng et al. [266] explored Codex for code document generation and demonstrated that few- shot in-context learning with systematic demonstration selection helps the GPT-3 model to achieve new SOTA results on two standard datasets related to Java language. Some of the research works [254], [255], [262] explored advanced prompting like CoT, brainstorming, differential prompting, etc., for coding tasks. Liu et al. [255] evaluated the code generation capabilities of ChatGPT by evaluating its performances on text-to-code and code- to-code generation tasks on CodeXGLUE [273] datasets. The authors observed that advanced prompting strategies like CoT enhance the code generation capabilities of models like ChatGPT. Li et al. [262] proposed Brain- storm, a new framework for code generation. Brainstorm involves three steps: brainstorming to generate diverse thoughts, thoughts selection to select the best thought using a ranking model and writing code to generate the code based on the problem statement and the best thought. The authors reported that the proposed frame- work helps ChatGPT to increase its performance by more than 50% and achieve new SOTA results on the Code- Contests [104] benchmark. Li et al. [254] showed that di- rectly using ChatGPT to find failure-inducing test cases results in poor performances. So, the authors proposed a new prompting strategy called “Differential Prompting”, which enables ChatGPT to achieve new SOTA results on the Quixbugs dataset [279]. Differential Prompting involves program intention inference followed by two more steps: program generation and differential testing.

4.10       Multimodal AI Tasks

Overview. Traditional AI systems are designed to handle data from a single modality such as text, image, audio or video. As real-world data is often multi- modal, researchers focused on developing multi-modal AI systems which can leverage input data from multiple modalities to generate more accurate results. Multi- modal AI systems leverage techniques from different areas of AI, like natural language processing, computer vision, speech processing etc., to process multi-modal in- put data effectively [280], [281]. Multi-Modal AI systems can perform a variety of understanding and generation tasks like visual question answering [179], [282]–[284], text-to-image generation [285]–[287], text-to-video generation [288], text-to-speech synthesis [289], speech-to-tex synthesis [289], image captioning [290] etc.

Research works exploring GLLMs for Multimodal AI tasks. After the huge success of LLMs in natural language generation and understanding tasks, the research community recently explored GPT-3 family models in multi-modal understanding and generation tasks in various combinations like image+language [179], [282]–[287], [290]–[298], video+language [288], [299], audio+language [300], [301]. Most of the research works focused on general domain datasets, which some of the research works focused on specific domains like healthcare [290], [298]. Table 10 presents a brief summary of research works exploring GLLMs for various multimodal AI tasks.

Some of the research works developed multi-model AI systems for a specific task like action generation [291], knowledge-based visual question answering [179], [282]– [284], x-ray report generation [290], named entity recognition [294], text-to-video generation [288], layout generation [296], text-to-image generation [287]. Kalakonda et al. [291] proposed GPT-3 based plug-and-play frame- work called Action-GPT for text-based action generation. Here, the authors generated multiple detailed body movement descriptions from the action phrases and then used them to generate actions. Shao et al. [282] proposed Prophet, which avoids using an external knowledge base by using GPT-3 as an implicit knowledge base and includes vanilla visual question answering to provide answer heuristics to GPT-3. The answer heuristics, along with caption and question information, provide rich task-specific information to the GPT-3 model, which results in much better performances. Ranjit et al. [290] proposed automatic x-ray report generation based on contrastively pretrained vision-language encoder and GPT-3 family models like GPT-3.5, ChatGPT and GPT-4. The contrastively pretrained encoder is used to encode input x-ray image into image vector embedding based on which the most similar sentences from the radiology report corpus are retrieved. The retrieved similar sentences form the context and allow LLM to generate a quality X-Ray report. Li et al. [294] proposed PGIM, a two-stage approach which utilizes ChatGPT as an implicit knowledge base for multi-modal NER task. In the first stage, ChatGPT, when prompted with text descriptions of the image, generates the auxiliary knowledge. In the second stage, the downstream model receives the raw text and ChatGPT-generated auxiliary knowledge as input. The authors reported that the proposed approach outperforms existing SOTA approaches based on text- text and text-image paradigms.

Hong et al. [288] proposed DirecT2V for text-to-video generation, which leverages GPT-4 model as a frame- level director. Here, the GPT-4 model generates descriptions for each frame based on a single prompt, and then the Text-to-Image model is used to generate frames based on these descriptions. Feng et al. [296] developed

PaperGLLMs ExploredTask(s)Prompt SettingsMultimodalityDomain
[291]GPT-3Text-based Action GenerationZSImage + LanguageGeneral
[292]ChatGPTTwenty Two Vision Language TasksZSImage + LanguageGeneral
[282]GPT-3Knowledge-based Visual Question AnsweringFSImage + LanguageGeneral
[300]ChatGPTAudio LabellingZSAudio + LanguageGeneral
[293]ChatGPTMulti-Image Reasoning, Multi-hop Document Un- derstanding, Open-World Concept Understand- ing, Video SummarizationZSImage + LanguageGeneral
[290]GPT-3.5, ChatGPT, GPT-4Chest X-Ray Report GenerationZSImage + LanguageHealthcare
[283]GPT-3Knowledge-based Visual Question AnsweringFSImage + LanguageGeneral
[299]GPT-3.5Five Video Understanding TasksZSVideo + LanguageGeneral
[301]GPT-4Generate InstructionsZSAudio + LanguageGeneral
[285]GPT-3.5, GPT-4Evaluator for Text-to-Image GenerationZSImage + LanguageGeneral
[286]GPT-3, GPT-3.5Editing in Text-to-Image GenerationFSImage + LanguageGeneral
[294]ChatGPTMultimodal Named Entity RecognitionFSImage + LanguageGeneral
[295]GPT-3Five vision language tasks (four classification tasks and one question answering task)FSImage + LanguageGeneral
[288]GPT-4Text-to-Video GenerationZSVideo + LanguageGeneral
[179]GPT-3Knowledge-based Visual Question AnsweringFSImage + LanguageGeneral
[296]GPT-3.5, ChatGPT, GPT-4Layout GenerationFSImage + LanguageGeneral
[302]ChatGPT, GPT-4Multimodal tasks covering text, video, audio and imagesZSMultimodal covering text, video, audio and imagesGeneral
[287]GPT-3.5, ChatGPT, GPT-4Controlled Text-to-Image GenerationZSImage + LanguageGeneral
[297]ChatGPTParaphrasingZSImage + LanguageGeneral
[289]ChatGPTAudio Understanding and Generation TasksZSMultimodal covering text, audio and imagesGeneral
[298]GPT-4Generate Instruction Tuning DatasetFSImage + LanguageHealthcare
[284]GPT-3Knowledge-based Visual Question AnsweringFSImage + LanguageGeneral

TABLE 10. Summary of research works exploring GLLMs for various multimodal AI tasks. Here ZS represents zero-shot, and FS represents few-shot.

LayoutGPT, which leverages LLM and Layout-to-Image models to generate 2D and 3D planning layouts from text descriptions. Zhang et al. [287] proposed “Control- GPT” based on LLMs and diffusion models for controllable text-to-image generation. Here, GPT-4 generates sketches based on Tikz code based on the text instructions, and then diffusion model generates realistic images with generated sketches and the text instructions as input. Here, the generated sketches help diffusion models to get a better idea about spatial relationships.

Some of the research works focused on developing multi-model AI systems which can handle multiple tasks [289], [292], [293], [295], [299], [302]. As ChatGPT is trained on one data modality i.e., text data, ChatGPT can only handle text inputs and training models from scratch for vision-language tasks, is not a feasible option as it involves huge computation. So, Wu et al. [292] developed Visual ChatGPT based on ChatGPT and various visual foundation models to handle 22 vision language tasks. Bhattacharya et al. [299] proposed a novel three-stage approach to handle five video understanding tasks. The proposed approach involves transforming video into text stories and then using this text content for video understanding tasks. Hakimov et al. [295] explored GPT- 3 model for five vision language tasks, including four classifications and one question answering. Here the model is prompted with text description of the input image along with other elements like task instruction and similar examples. Huang et al. [289] proposed AudioGPT, which allows ChatGPT to handle multiple audio understanding and generation tasks with the help of audio foundation models.

Some of the research works explored GPT-3 family models for other tasks like data labelling [300], generating instructions [301], data generation [297], prompt editing [286] and evaluation [285] while developing multimodal AI systems. Mei et al. [300] used ChatGPT to rewrite those noisy audio captions and developed Wav- Caps, an audio captions dataset of 400k instances. The authors reported that the models trained on WavCaps datasets achieve new SOTA results. Zhang et al. [301] developed SpeechGPT and then do cross-modal instruction tuning to enhance its multi-model instruction following ability. Here, the authors use GPT-4 to generate the instructions for diverse tasks. Fan et al. [297] proposed LaCLIP (Language augmented Contrastive Language- Image Pretraining), an extended version of CLIP which applies data augmentation to both text and image data to ensure that the model gets exposed to diversified texts during training. Here the data augmentation is performed using the open-source LLaMA model in few- shot settings, and the examples for LLaMA ICL are generated using ChatGPT. Zhu et al. [286] explored GPT-3 and GPT-3.5 models for prompt editing in text- to-image generation. The authors observed a potential reduction of 20-30% in the remaining edits required by implementing the prompt edits suggested by GPT-3 family models. Lu et al. [285] proposed LLMScore, a new metric which can effectively capture both image and object-level compositionality for text-to-image gen- eration evaluation.

4.11       Machine Learning Tasks

Overview. Machine learning (ML) is an area of artificial intelligence (AI) that deals with the development of algorithms that can learn from data and make decisions [305]. Even though machine learning algorithms are successfully used in various real-world applications, creating an effective ML solution for a new task can be difficult due to the numerous design choices involved. In recent times, AutoML has evolved as a solution to reduce the human effort involved in designing ML solutions [307]. However, AutoML algorithms suffer from various drawbacks [305], like (i) the requirement of multiple rounds of trial-and-error, resulting in significant time consumption, (ii) starting the search for a new task from scratch, ignoring past experience gained from the previous tasks and (iii) many AutoML methods lack interpretability because of their black-box nature.

Research works exploring GLLMs to automate ma- chine learning tasks. Inspired by the success of GLLMs in other tasks, the research community explored GLLMs as an alternative to AutoML to automate machine learning tasks [303]–[306]. Table 11 presents a summary of research works exploring GLLMs to automate machine learning tasks. Zheng et al. [303] explored how effective is GPT-4 for neural architecture search, i.e., designing optimal neural network configurations. The proposed approach involves two steps, namely (i) GPT-4 generates the optimal neural architecture based on the given problem statement, (ii) the generated configuration is evaluated, and for further refinement, the evaluation results along with the problem statement are passed to the model. This two-step process is repeated for a certain number of iterations to achieve the optimal configuration. Shen et al. [304] proposed HuggingGPT to solve AI tasks with the help of GLLMs like ChatGPT and models in AI communities like Hugging Face. HuggingGPT in- volves four steps, namely task planning, model selection, task execution and response generation. The authors reported that HuggingGPT achieves promising results in solving AI tasks in language, vision and speech.

Zhang et al. [305] proposed MLCopilot, which lever- ages the power of GLLMs to solve machine learning tasks. MLCopilot works in two stages, namely offline and online. The offline stage involves creating an experience pool from which GLLM is used to retrieve relevant knowledge. The online stage involves retrieving relevant examples from the experience pool, and then GLLM generates results based on the task description, relevant examples and knowledge. Zhang et al. [306] proposed AutoML-GPT, which leverages the advanced GPT-4 GLLM to automatic machine learning tasks and reduces human efforts in building machine learning models. AutoML-GPT involves two stages. The first stage involves composing a prompt paragraph based on the model and data cards. The second stage involves performing the four crucial steps from data processing to training log prediction.

4.12        Planning

Overview. Many important industries, like finance and banking, often involve repetitive sequential tasks. These workflows, despite their significance, are typically not fully automated or formally defined. Recently, due to strong reasoning capabilities, the research community explored GLLMs for planning. Some of the research works [309], [311] directly used LLMs for planning, while some of them [308], [310] explored LLMs for planning extraction, which can then be used by automated systems.

Research works exploring GLLMs for planning. Table 12 presents a summary of research works exploring GLLMs for planning. Human models are crucial in facilitating human-robot interaction (HRI), as they empower robots to plan their behaviour based on the impact of their actions on individuals. As it is difficult to craft good human labels, Zhang et al. [309] used the GPT-3.5 model (i) as zero-shot human models and also (ii) for planning in trust-related scenarios. Hu et al. [311] proposed a novel prompting strategy called “Chain of Symbol” prompting to elicit better the planning abilities of large language models like InstructGPT and ChatGPT. Unlike CoT prompting, which uses natural language descriptions to represent complex environments, CoS prompting uses condensed symbols to represent them in intermediate reasoning steps. The authors reported that CoS prompting outperforms CoT prompting in both performance and efficiency.

There are usually natural language documents that describe the procedures for the company’s employees. Plan extraction methods offer the opportunity to extract structured plans from these natural language descriptions of workflows [93, 95]. These extracted plans can

PaperTask(s)GLLMs ExploredPrompt SettingsLanguage(s)
[303]Neural Architecture SearchGPT-4ZSEnglish
[304]Multiple AI tasks in language, speech and vision areasGPT-3.5,          ChatGPT, GPT-4FSEnglish
[305]Machine Learning TasksGPT-3.5FSEnglish
[306]Machine Learning TasksGPT-4FSEnglish

TABLE 11. Summary of research works exploring GLLMs to automate machine learning tasks. Here ZS represents zero-shot, and FS represents few-shot.

PaperTask(s)GLLMs ExploredPrompt SettingsLanguage(s)SOTA Results
[308]Plan ExtractionGPT-3FSEnglishYes
[309]Planning in Human-Robot InteractionGPT-3.5ZSEnglishNo
[310]Plan ExtractionGPT-3.5FSEnglishNo
[311]PlanningInstructGPT, ChatGPTFSEnglishNo

TABLE 12. Summary of research works exploring GLLMs for planning. Here ZS represents zero-shot, and FS represents few-shot.

then be used by automated systems. Olmo et al. [308] explored the GPT-3 model for plan extraction in few- shot settings from the natural language descriptions of workflows and showed that GPT-3 model outperforms existing SOTA models in some cases. Xie et al. [310] explored GPT-3.5 models to extract plans from natural language descriptions. The authors reported that the models are poor planners on their own, which is in line with the existing works [312]–[314] and are better at extracting plans from natural language. However, these models are sensitive to prompts and also struggle in the case of tasks involving spatial or numerical reasoning.

5         PERFORMANCE OF GLLMS IN SPECIFIC DOMAINS

Apart from the general domain, natural language processing is also explored in specific domains like health- care, finance, legal, social media, etc. Analyzing domain- specific texts is more challenging because of domain- specific terminology and abbreviations, complex language structures, etc. In domains like healthcare, finance and legal, domain experts use many words and abbreviations that are specific to the domain and not commonly found in general domain texts. In domains like social media, the texts are mostly authored by the general public using informal language and slang words. Moreover, social media texts are noisy, with many misspelt words, emojis, irregular grammar and abbreviations [315], [316]. Inspired by the success of pretrained language models like BERT, RoBERTa, ELECTRA, DeBERTa and T5 in the general domain, these models are also explored for domain-specific NLP tasks [1]. However, the performance of general domain models is limited as these models are pretrained on general domain texts [81], [89], and fine-tuning alone cannot provide enough domain knowledge [1]. So, the research community focused on developing domain-specific pretrained language models either by continual pretraining or pretraining from scratch [1], [3]. Currently, domain-specific pretrained language models achieve state-of-the-art results in most tasks in specific domains like healthcare, finance, legal, social media, etc.

GPT-3 family large language models achieve impressive performances in most NLP tasks in zero and few- shot settings in the general domain. Surprisingly, these models outperform fine-tuned pretrained language models in some tasks and achieve state-of-the-art results [144], [155], [156], [158]. Inspired by the massive success of GLLMs in the general domain, the research community explored GLLMs in specific domains to assess how good these models are in domain-specific NLP tasks. Moreover, an extensive evaluation of these models in domain-specific tasks helps to arrive at valuable insights that will guide the research community to improve the performance further and increase the usage of these models in domain-specific NLP tasks.

5.1       Healthcare Domain

The recent works explored GLLMs for a variety of clinical NLP tasks like question answering [117], [195], [317], [320], [322], [323], [326], [333], [335], [337], [338], [341], [342], text de-identification [318], dialogue summarization [319], [328], [330], named entity recognition [149], [321], relation extraction [321], text classification [138], [321], [326], [335], semantic similarity [321], [326], text simplification [324], [327], [343], relation classification [149], [326], text summarization [325], [331], natural language inference [137], [138], [326], [335], word sense disambiguation [329], biomedical evidence extraction [329], coreference resolution [329], medical status extraction [329], medical attribute extraction [329], synonym generation [334], clinical decision support [336], [340] and diagnostic lists generation [339]. Most of the research

PaperGLLMs ExploredTask(s)Prompt SettingsLanguage(s)Outperforms Domain-Specific Models
[317]ChatGPT, GPT-4Question AnsweringZSEnglish
[318]ChatGPT, GPT-4Text De-identificationZSEnglishYes
[319]GPT-4Dialogue SummarizationFSEnglishYes
[320]GPT-3.5, ChatGPT, GPT-4Question AnsweringZS, FSEnglishYes
[321]GPT-3.5, GPT-4Named Entity Recognition, Relation Extraction, Docu- ment Classification and Semantic SimilarityZS, FSEnglishYes
[322]GPT-3.5, ChatGPTQuestion AnsweringZSJapanese
[323]GPT-3.5, GPT-4Question Answering, ReasoningZSChineseYes
[324]GPT-3Text SimplificationFSEnglish
[149]GPT-3Entity Extraction, Relation ClassificationFSEnglishNo
[[137]ChatGPT, GPT-4Natural Language InferenceZS, FSEnglish
[325]ChatGPTText SummarizationFSEnglishYes
[138]GPT3.5, GPT4Natural Language Inference, Document ClassificationZS, FSEnglish
[195]GPT-3, ChatGPT, GPT-4Question AnsweringFSJapanese
[326]GPT-3Natural Language Inference, Relation Classification, Semantic Similarity, Question Answering, Text Classi- ficationFSEnglishNo
[327]ChatGPTText SimplificationZSEnglish
[328]GPT-3, GPT-4Dialogue SummarizationFSEnglish
[329]GPT-3Clinical Sense Disambiguation, Biomedical Evidence Extraction, Coreference Resolution, Medication Status Extraction, Medication Attribute ExtractionZS, FSEnglish
[330]GPT-3Dialogue SummarizationZS, FSEnglish
[331]GPT-3Text SummarizationZS,FSEnglish
[332]ChatGPTMulti-Turn Medical DialogueZSChineseNo
[117]GPT-4Question AnsweringFSEnglishNo
[333]ChatGPTQuestion AnsweringZSChinese
[334]GPT-3Synonym GenerationZSEnglish
[335]GPT-3Natural Language Inference, Question Answering, Text ClassificationZSEnglishNo
[336]ChatGPTClinical Decision SupportZSEnglish
[337]ChatGPTQuestion AnsweringZSEnglish
[338]ChatGPTQuestion AnsweringZSEnglish
[339]ChatGPTDiagnosis Lists GenerationZSEnglish
[340]ChatGPTClinical Decision SupportZSEnglish
[341]GPT-3,  GPT-3.5, ChatGPTQuestion AnsweringZSEnglish
[342]ChatGPTQuestion AnsweringZSEnglish
[343]ChatGPT, GPT-4Text SimplificationZSEnglish

TABLE 13. Summary of research works exploring GLLMs for various NLP tasks in the healthcare domain. Here ZS represents zero-shot, and FS represents few-shot. Here ’-’ represents there is no comparison between GLLMs and domain-specific pretrained language models in the paper.

focused on English datasets, except a few focused on other languages like Japanese [195], [322] and Chinese [323], [332], [333]. Table 13 presents a summary of re- search works exploring GLLMs for various NLP tasks in the healthcare domain.

Lyu et al. [343] investigated the performance of Chat- GPT and GPT-4 models in the healthcare domain, specifically the radiology area, by evaluating their ability to simplify the content in radiology reports. Experiment results showed that (i) GPT-4 performs better than Chat- GPT. and (ii) optimized prompt with detailed instructions improves the performance for both models by a good margin. Antaki et al. [342] evaluated the effective- ness of ChatGPT in answering Opthalmology questions.

The test set consists of both easy and moderate-level questions. Experiment results showed that ChatGPT achieves an average accuracy of 49.25%. Specifically, ChatGPT is able to answer the questions with good accuracy in general medicine. However, its performance in specific sub-areas of Opthalmology is worst. Gilson et al. [341] evaluated GLLMs like GPT-3, GPT-3.5, and ChatGPT model in answering the medical questions in Step 1 and Step 2 exams of USMLE. Experiment results showed that ChatGPT outperforms the other two models by a good margin. Rao et al. [336] demonstrated that ChatGPT performs better in the final diagnosis than in the initial diagnosis. This is because ChatGPT has access to more clinical data during the final diagnosis than the initial one.

Carpenter et al. [334] demonstrated that GPT-3 can be used for the synonym generation for drugs of abuse. The authors query GPT-3 repeatedly for each drug to generate multiple synonyms, which are later filtered. The generated synonyms are then used to build a lexicon that is helpful for pharmacovigilance on social media platforms. Inspired by the success of the GPT-3 model for text summarization in the general domain, Shaib et al. [331] explored the GPT-3 model for summarizing biomedical documents. Experiment results revealed that (i) GPT-3 performance is promising in the case of single document summarization and (ii) GPT-3 struggles to summarize the content from multiple biomedical documents. Nair et al. [330] proposed a novel approach called “MEDSUM-ENT”, a multi-stage framework for clinical dialogue summarization. The proposed method lever- ages the GPT-3 model through multiple intermediate calls to extract medical entities from the conversations. In the final step of summarization, the extracted entities, task instructions and in-context examples help the GPT- 3 model to generate high-quality summaries. Based on the evaluation of radiology reports simplified by Chat- GPT, Jeblick et al. [327] reported that ChatGPT-generated simplified radiology reports are not potentially harmful, complete and factually correct. However, further analysis reveals that some simplified reports contain factually incorrect sentences, potentially harmful paragraphs and a lack of essential medical findings.

Hirosawa et al. [339] investigated the effectiveness of ChatGPT for clinical diagnosis by evaluating its ability to generate accurate diagnosis lists for clinical vignettes with common chief complaints. Experimental results showed that ChatGPT can generate diagnosis lists with good accuracy. However, the accuracy rate of ChatGPT is still less than the accuracy rate of physicians. Wang et al. [333] evaluated the performance of the ChatGPT model in answering medical questions in the Chinese language. Here, ChatGPT is prompted with questions in both English and Chinese to avoid language barriers. Experimental results show that the performance of ChatGPT is much lower than the average performance of the medical students. For example, ChatGPT correctly answers 45.8% of questions, while the average answering rate of medical students is 67.9% in 2021.

Some of the research works demonstrated that domain-specific pretrained language models outperform GLLMs. Hernandez et al. [335] compared the performance of the GPT-3 model with the performances of general and domain-specific pretrained language models on three healthcare NLP tasks: natural language inference, question answering and text classification. Experiment results showed that domain-specific pretrained language models achieve better results even though they are much smaller than GPT-3. Xu et al. [332] introduced MedGPTEval, a benchmark to assess large language models in the healthcare domain. An extensive evaluation showed that domain-specific Chinese LLM outperforms general-purpose models like ChatGPT and ERNINE Bot. Singhal et al. [117] introduced MedPaLM2, a healthcare domain-specific LLM obtained by domain- specific finetuning of the PaLM2 [68] model. Experiment results showed that MedPaLM2 outperforms few- shot GPT-4 and achieves new state-of-the-art results on the MultiMedQA benchmark. Moradi et al. [326] investigated the performances of BioBERT and GPT-3 in few-shot settings on five biomedical NLP tasks: text classification, natural language inference, question answering, relation extraction and semantic similarity. The authors observed that BioBERT and GPT-3 models underperform the model fine-tuned using full training data. Moreover, the BioBERT model outperforms GPT-3 in few-shot settings even though the BioBERT model is 514 times smaller than GPT-3.

Some research works showed that GLLMs can outperform domain-specific pretrained language models. Ma et al. [325] proposed ImpressionGPT, a novel approach for summarizing radiology reports using ChatGPT. The proposed method involves dynamic prompt construction and iterative optimization to enhance the performance of ChatGPT further. Evaluation on two standard datasets showed that the proposed framework achieves new SOTA results outperforming fine-tuned models like ChestXrayBERT [348]. Liu et al. [323] introduced CMExam, a dataset with 60k+ multiple-choice medical questions in the Chinese language and evaluated GLLMs like GPT-3.5 and GPT-4 on answer prediction and answer reasoning tasks. The authors observed that GPT-4 achieves the best results for both tasks, outper- forming GPT-3.5 and medical domain-specific Chinese LLMs like Huatuo [352] and DoctorGLM [349]. Chen et al. [321] explored GLLMs like GPT-3.5 and GPT-4 on eight datasets spanning four tasks in zero and few-shot settings. The authors observed that fine-tuned PubMedBERT outperforms both the GLLMs in all the biomedical tasks except question answering. In the case of biomedical question answering, GPT-4 outperforms the fine-tuned PubMedBERT model by a large margin of 17 Giorgi et al. [319] explored models like Longformer Encoder-Decoder (LED) [96] based on supervised fine- tuning and GLLMs like GPT-4 based on few-shot ICL for

PaperGLLMs ExploredTask(s)Prompt SettingsLanguage(s)Outperforms Domain-Specific Models
[344]GPT-3Natural Language InferenceZS, FSEnglish
[188]GPT-3.5Question AnsweringZSEnglish
[345]GPT-3Question Answering, Text GenerationZSEnglish
[346]ChatGPTText ClassificationZS, FSEnglishNo
[347]ChatGPTQuestion Answering, Text GenerationZSEnglish

TABLE 14. Summary of research works exploring GLLMs for various NLP tasks in the legal domain. Here ZS represents zero- shot, and FS represents few-shot. Here ’-’ represents there is no comparison between GLLMs and domain-specific pretrained language models in the paper.

clinical dialogue summarization as a part of MEDIQA- Chat 2023 [350] shared task. Here, the authors used Instructor [351] to select the most similar examples for few-shot ICL. Experiment results based on automatic metrics like BERTScore and ROUGE demonstrated that GPT-4 not only outperforms the LED model but also achieves first rank in the shared task. For medical text de-identification, Liu et al. [318] proposed a novel approach called “DeID-GPT”, a two-step approach based on GLLMs. In the first step, HIPAA identifiers are included in the prompt. In the second step, GLLM receives the prompt and the medical record based on which the model generates the de-identified medical record having the personal information masked. The authors observed that GPT-4 outperforms not only ChatGPT but also fine-tuned models based on BERT, RoBERTa and ClinicalBERT.

5.2       Legal Domain

The recent works explored GLLMs for a variety of legal NLP tasks like natural language inference [344], question answering [188], [347], [352], text generation [345], [347] and text classification [346]. Table 14 presents a summary of research works exploring GLLMs for various NLP tasks in the legal domain. Bommarito et al. [188] evaluated the performance of the GPT3.5 model in the legal domain by evaluating its ability to answer bar exam questions. The model answers the questions correctly at a rate of 50%, which is 25% more than the random guess baseline. However, the model performance is almost 18% less than the human performance, and overall model performance is below the passing threshold. Nguyen et al. [345] presented LawGPT 1.0, the first-ever chatbot model based on GPT-3 for the legal domain. The GPT-3 model is pretrained on mostly generic corpus, so it lacks domain-specific knowledge. To add domain-specific knowledge, LawGPT is developed by fine-tuning the GPT-3 model on the law corpus. Experimental results showed that LawGPT 1.0 performs on par with existing legal assistants.

Chalkidis et al. [346] investigated how effective Chat- GPT is for legal text classification by evaluating the model performance on the LexGLUE [360] benchmark, which consists of seven legal text classification datasets.

The evaluation is performed in both zero and few-shot settings. Experiment results showed that ChatGPT per- forms poorly on legal text classification datasets. Choi et al. [347] demonstrated that the performance of ChatGPT is just above the passing threshold, i.e., equivalent to a C+ grade student. The authors found that advanced prompts like CoT [361] and Ranking prompts performed worse or the same as simple prompts for multiple-choice questions. For essay writing, the authors used carefully crafted simple prompts by including specific instructions at the end of the prompt.

5.3       Finance Domain

The recent works explored GLLMs for a variety of finance NLP tasks like text classification [136], [359], sentiment analysis [136], [352], [354], [356], named entity recognition [136], [356], question answering [136], [357], pairwise ranking [355], claim detection [356] and relation extraction [358]. Table 15 presents a summary of research works exploring GLLMs for various NLP tasks in the finance domain.

Li et al. [136] compared the performances of gen- eral LLMs like ChatGPT and GPT-4 in the finance do- main with domain-specific models like BloombergGPT [115] and small fine-tuned models like FinBERT [82] and FinQANet [362]. The evaluation is done on five different datasets related to four financial NLP tasks: news headlines classification, sentiment analysis, entity extraction, and question answering. The ChatGPT and GPT4 models do well in question-answering task but lag behind in tasks requiring domain-specific knowledge like entity extraction and sentiment analysis. Fatouros et al. [353] evaluated the effectiveness of ChatGPT for financial sentiment analysis by assessing its performance on the forex-related news headlines dataset. Experiment results showed that ChatGPT outperforms the domain- specific FinBERT [83] model by a large margin of 35% and also exhibits a high correlation with market returns. Leippold et al. [354] explored GPT-3 for financial sentiment analysis and to generate adversarial attacks. Experiment results showed that FinBERT outperforms keyword-based approaches and the few-shot GPT-3 model in financial sentiment analysis. To study the

PaperGLLMs ExploredTask(s)Prompt SettingsLanguage(s)Outperforms Domain-Specific Models
[136]ChatGPT, GPT-4News Headlines Classification, Financial Sentiment Analysis, Named Entity Recognition, Question An- sweringZSEnglishYes
[353]ChatGPTSentiment AnalysisZSEnglishYes
[354]GPT-3Sentiment AnalysisZSEnglishNo
[355]GPT-3.5Pairwise RankingFSChinese
[356]ChatGPTSentiment Analysis, Claim Detection, Named Entity RecognitionZSEnglishNo
[357]ChatGPT, GPT-4Question AnsweringZS, FSChinese
[358]ChatGPT, GPT-4Relation ExtractionFSEnglish
[352]ChatGPTSentiment AnalysisZSChinese
[359]GPT-3.5, GPT-4Text ClassificationZS, FSEnglish

TABLE 15. Summary of research works exploring GLLMs for various NLP tasks in the finance domain. Here ZS represents zero-shot, and FS represents few-shot. Here ’-’ represents there is no comparison between GLLMs and domain-specific pretrained language models in the paper.

robustness of FinBERT-based and keyword-based approaches, the authors explored GPT-3 to generate adversarial attacks. The main advantage of GPT-3 over existing adversarial attack-generating methods is that the model makes more subtle changes to the instances such that they are not noticeable to humans but still can fool the models. Wiriyathammabhum et al. [355] explored instruction fine-tuned T5 and GPT-3.5 models to evaluate investments-related social media posts in Chinese. The task involves two subtasks, namely pairwise ranking and unsupervised ranking. Experiment results showed that the few-shot prompted GPT-3.5 model outperforms the instruction fine-tuned T5 model and the few-shot prompted GPT-3.5 model with English-translated social media posts.

Shah et al. [356] compared the performance of Chat- GPT with the performance of fine-tuned pretrained language models for three different financial NLP tasks: claim detection, sentiment analysis and named entity recognition. The authors observed that fine-tuned models outperform ChatGPT, but ChatGPT performs much better than some open-source LLMs. Zhang et al. [357] introduced FinEval, a new benchmark to evaluate the financial domain of knowledge of LLMs in the Chinese language. FinEval includes 4,661 multiple-choice questions in Chinese language from four different categories spanning 34 academic subjects. Experiment results showed that GPT-4 achieves around 70% accuracy and outperforms all other LLMs, including ChatGPT and Chinese LLMs.

Rajpoot et al. [358] assessed the effectiveness of Chat- GPT and GPT-4 for financial relation extraction in few- shot settings. As the choice of examples is crucial in few-shot ICL, the authors explored learning free and learning-based retriever for example selection. The authors observed that GPT-4 outperforms ChatGPT by a decent margin, and the learning-based retriever per- forms better than the learning-free retriever.

6         MULTILINGUAL PERFORMANCE OF GLLMS

Overview. GLLMs are pretrained over large volumes of text data from multiple languages. For example, the corpus used to pretrain the GPT-3 model includes text from around 90 languages, and the percentage of English text is more than 90% [4], [366]. In the beginning, most of the research focused on assessing the performance of GLLMs on English datasets only. However, it is essential to evaluate these models on datasets from non-English languages, especially low-resource languages, to know how effective GLLMs are for non-English languages, and the insights gained from the comprehensive evaluation help to further improve these models towards non- English languages.

Research works exploring GLLMs in multilingual settings. Recently, some of the research works focused on evaluating GLLMs across various non-English languages. The evaluation is done on various tasks like parts of speech tagging [363], [366], named entity recognition [363], [370], relation extraction [363], natural language inference [363], [366], [370], question answering [363], [365]–[367], [370], text summarization [363], [365], [366], [370], commonsense reasoning [363], [366], grammar error correction [364], text generation [365], [369], paraphrase identification [366], sentiment analysis [132], [366], [370], language identification [132], machine translation [132], [370], genre identification [131], hate speech detection [368] and toxicity detection [370]. Most of the research focused on general domain datasets, except a few focused on other domains like social media [368], [370] and news [370]. Table 16 presents a summary of research works exploring GLLMs for NLP tasks in multilingual settings.

Bang et al. [132] presented an extensive multilingual evaluation of ChatGPT across three tasks: sentiment analysis, language identification and machine translation. When compared to English, the performance of

PaperGLLMs exploredTask(s)Prompt SettingsLanguage(s)Domain(s)
[363]ChatGPTPoS Tagging, Entity Extraction, Relation Extraction, Natural Language Inference, Question Answering, Text Summariza- tion, Common Sense ReasoningZS37 LanguagesGeneral
[364]ChatGPTGrammar Error CorrectionZS, FSEnglish, German, ChineseGeneral
[365]GPT-3Question Answering, Natural Language Generation, Text SummarizationZSGerman, Spanish, Russian, Turkish, CatalanGeneral
[366]GPT-3.5, ChatGPT, GPT-4Natural Language Inference, Paraphrase Identification, Commonsense Reasoning, Question Answering, Parts of Speech Tagging, Sentiment Analysis, Text SummarizationZS70 languagesGeneral
[132]ChatGPTSentiment Analysis, Language Identification, Machine TranslationZSMultiple language includ- ing low resource languages like Sudanese, Javanese etc.General
[131]ChatGPTGenre IdentificationZSEnglish, SlovenianGeneral
[367]ChatGPTQuestion Answering, ReasoningZSSix languages including Chinese, German and FrenchGeneral
[368]ChatGPTHate Speech DetectionZSEleven languages includ- ing Hindi, Arabic and Ital- ianSocial Media
[369]GPT-4Three Text Generation TasksZSTen languages including Chinese and Japanese.General
[370]ChatGPT, GPT-4Question Answering, Sentiment Analysis, Text Summa- rization, Named Entity Recognition, Toxicity Detection, Machine Translation, Natural Language Inference, Casual ReasoningZS, FSIndonesian,        Vietnamese, Thai, TamilGeneral, Social Media, News

TABLE 16. Summary of research works exploring GLLMs for NLP tasks in multilingual settings. Here, ZS represents zero-shot, and FS represents few-shot.

ChatGPT degrades in the case of low-resource languages, particularly in the case of languages with non- Latin scripts. Das et al. [368] assessed the effectiveness of ChatGPT for emoji-based hate speech detection in multilingual settings. The authors reported that ChatGPT exhibits good performance but tends to misclassify abusive content as hate speech for non-English languages in the case of non-protected groups. Moreover, Armengol et al. [365] reported that the performance of GPT-3 can be improved in the case of low-resource languages with optimized tokenization.

The focus of existing benchmarks like HELM [371] and BIG-Bench [372] is on the English language. So, some of the research works focused on introducing new benchmarks to facilitate a systematic and comprehensive evaluation of the multilingual performance of GLLMs [366], [370]. For example, Ahuja et al. [366] presented MEGA, a comprehensive evaluation benchmarking having 16 datasets covering 70 languages. Based on the evaluation of GLLMs like GPT-3.5, ChatGPT and GPT-4, the authors reported that GLLMs perform well in the case of languages with Latin scripts, and the performance is worst in the case of low-resource languages with non- Latin scripts across tasks. One of the possible reasons for this is the quality of tokenization. Similarly, Leong et al.

[370] introduced BHASA, a benchmark to evaluate the performance of LLMs in four Southeast Asian languages. The benchmark consists of 20 datasets covering eight NLP tasks. The authors reported that (i) GPT-4 achieves better results compared to ChatGPT, and (ii) overall, the performance on some of the tasks is promising, with a lot of room for improvement in other tasks.

Some of the existing works demonstrated that using prompts in English improves the performance of GLLMs in the case of non-English languages [131], [363]. For example, Lai et al. [363] performed a comprehensive evaluation of the multilingual abilities of ChatGPT on seven tasks covering more than 30 languages ranging from high-resource to extremely low-resource languages. The experiment results confirmed the bias of ChatGPT towards the English language, i.e., the performance is better for English compared to other languages and prompts in the English language can enhance the performance for non-English languages. The possible reason for the bias of GLLMs towards the English language is that GLLMs are trained mostly on English text corpus; hence, these models can better understand the prompt if it is in English [131].

Some of the research works investigated how GLLMs exhibit multilingual capabilities [367] and how effective GLLM-based evaluators are in scaling up evaluation in multilingual settings [369]. Zhang et al. [367] pro- posed a novel back translation prompting approach to systematically study how ChatGPT exhibit multilingual capabilities, although these models are largely pretrained on the English text corpus. The authors demonstrated that ChatGPT does translation in multilingual settings. Moreover, the multilingual performance of GLLMs is good only in the case of tasks which can translated. Hada et al. [369] assessed the effectiveness of GPT-4 as

PaperGLLMs ExploredTask(s)Prompt SettingsDomain(s)Language(s)Outperforms Human Annotators
[373]ChatGPTStance, Relevance, Frame and Topics De- tectionZSSocial Media, NewsEnglishYes
[374]GPT-3.5Three Binary Text Classification TasksZS, FSGeneralEnglishYes
[375]GPT-4Political Tweets ClassificationZSSocial MediaEnglishYes
[376]ChatGPTStance Detection, Sentiment Analysis, Hate Speech Detection, Bot DetectionZSSocial MediaEnglishNo
[377]ChatGPTDetection of Hateful, Toxic and Offensive CommentsZSSocial MediaEnglishNo
[378]GPT-3.5, GPT-4Adverse Drug Reaction ExtractionZS, FSHealthcareEnglish
[379]GPT-3Text Entailment, Topic Classification, Sentiment Analysis, Answer Type Classi- fication, Question Generation, Text Gen- erationZSGeneralEnglish
[380]GPT-3Sentiment Analysis, Relation Extraction, Named Entity RecognitionFSGeneralEnglish
[381]GPT-3.5Named Entity RecognitionZSHealthcareEnglish, French, Spanish, Italian, Basque
[382]GPT-3.5Text SummarizationZS, FSGeneralEnglish
[383]ChatGPTDetection of Stance, Topics, Relevance, General Frame and Policy FrameZS,FSSocial Media, NewsEnglishYes
[324]GPT-3Radiology Text SimplificationFSHealthcareEnglish

TABLE 17. Summary of research works exploring GLLMs for data labelling. Here, ’-’ represents that the paper doesn’t include a comparison between GLLMs and human annotators.

an evaluator for natural language generation tasks in multilingual settings. The authors reported that GPT-4 tends to favour high scores and should be used carefully.

7         DATA LABELLING AND DATA AUGMENTATION ABILITIES OF GLLMs

7.1       Data Labelling

Overview. Large language models, specifically GLLMs, have achieved impressive performances in most of the NLP tasks, highlighting the huge potential of these models. However, large model size, high latency, high inference costs, proprietary access (in the case of GLLMs) and confidentiality concerns (in the case of sensitive domains like medical [381]) have become bottlenecks for the practical use of these models. Because of these bottlenecks, in environments with constrained resources or confidentiality constraints, pretrained language models are preferred over GLLMs as these models are much smaller in size and also more efficient compared to GLLMs [384]. For example, BERT base and large models contain just 110M and 340M parameters, while the GPT-3 model contains 175B parameters. Moreover, it is reported that GLLMs are trailing the SOTA models, with 4% to 70% lower performance when evaluated across a set of 25 diverse natural language processing tasks [133].

The performance of fine-tuned pretrained language models is largely determined by the quality as well as the quantity of labelled data. Human-annotated data is considered the gold standard [385], [386], and we have two strategies for this [373], [375]. The first one is using trained expert coders like students and research assistants, and the second one is using crowd workers from online platforms like Amazon Mechanical Turk. Although human-labelled data is considered the gold standard, the human annotation process is expensive, laborious and time-consuming. The second strategy, i.e., using crowd workers, is comparatively less expensive, but there is a growing concern regarding the degrading annotation quality of crowd workers [387]. Moreover, the annotation quality varies with annotators, and hence it is consistent. To address the challenges associated with the human annotation process, there is a growing interest in the NLP research community to leverage the extraordinary generative abilities of GLLMs to make the data annotation process less expensive, faster and consistent. Similar to the human annotation process, GLLMs are provided with detailed instructions along with some labelled examples to label the data.

Research exploring GLLMs for data labelling. The research community explored GLLMs for data labelling in a variety of NLP tasks like stance detection [373], [376], political tweets classification [375], sentiment analysis [376], [379], [380], hate speech detection [376], [377], bot detection [376], toxic comments detection [377], offensive comments detection [377], adverse drug reaction extraction [378], text entailment [379], topic classification [379], text generation [379], answer type classification [379], question generation [379], relation extraction [380], named entity recognition [380], [381], text summarization [382], radiology text simplification [324] etc. Most of the research works focused on English datasets, except a few research works focused on other languages like French [381], Spanish [381], Italian [381] and Basque [381]. Table 17 presents a summary of research works exploring GLLMs for data labelling.

Gu et al. [378] labelled sentences from PubMed abstracts using the GPT-3.5 model and then fine-tuned the PubMedBERT model for adverse drug reaction extraction. Experiment results showed that (i) PubMedBERT achieves results comparable to the SOTA model and (ii) PubMedBERT outperforms the GPT-3.5 and GPT-4 models by large margins of 6 and 5 points in F1 score, respectively. Based on the evaluation of multiple NLU and NLG tasks, Wang et al. [379] demonstrated that GPT-3 labelled data can result in a 50 to 96% reduction in labelling expenses. Moreover, pretrained language models fine-tuned on GPT-3 labelled data outperform the few-shot GPT-3 model in both NLU and NLG tasks. Further, the authors proposed an approach based on active learning to make use of both human and GPT- 3 labels, which further enhances the performance of the fine-tuned models. Meoni et al. [381] investigated the effectiveness of GPT-3.5 labelled data and dictionary- based labelled data in fine-tuning pretrained language models to extract clinical entities in multiple languages like English, Spanish, Basque, Italian and French. The authors reported that (i) the performance of GPT-3.5 la- belled data is on par with dictionary-based labelled data, and (ii) combining annotations from both approaches further enhances the results. Xu et al. [382] proposed InhertiSumm, a novel approach for training small text summarization models like ZCode++ [388] using GPT-3.5 generated summaries. The authors showed that the ZCode++ model with just 390M parameters trained us- ing GPT-3.5 generated summaries performs on par with GPT-3.5 in zero and few-shot settings.

Zhu et al. [376] investigated how effective ChatGPT is for labelling data for social computing tasks. Based on the evaluation of five datasets spanning over tasks like stance detection, hate speech detection, bot detection and sentiment analysis, the authors reported that ChatGPT achieves an average accuracy of 60.9. Li et al. [377] investigated the ability of ChatGPT to label hateful, offensive and toxic comments and compared the performances with MTurk annotations. The authors observed that ChatGPT performance is promising as it is able to label 80% of comments correctly. Moreover, the performance of ChatGPT is more consistent for non- harmful comments than harmful comments.

Some of the research works [373]–[375], [383] showed that GLLMs as data annotators can outperform human annotators. Gilardi et al. [373] investigated the effective- ness of ChatGPT as an annotator in zero-shot settings for four text classification tasks involving tweets and news articles. The authors reported that ChatGPT is more effective than MTurk crowd-workers as (i) Chat- GPT achieves 25 points more than crowd-workers in terms of accuracy, (ii) ChatGPT is approximately 30 times cheaper, and (iii) intercoder agreement of ChatGPT is more than crowd-workers. He et al. [374] proposed a novel approach called “explain then annotate” to enhance the performance of GLLMs as text data annotators. The proposed approach involves two steps: (i) GLLM generates explanations for the demonstrations and then (ii) annotates the data by leveraging annotation guidelines, demonstrations and explanations through CoT prompting. Evaluation on three binary text classification tasks revealed that GPT-3.5 outperforms crowd- workers on one task and matches the performance of crowd-workers on the other two tasks. Tornberg et al.

[375] demonstrated that zero-shot GPT-4 outperforms human annotators in labelling political English tweets. Further analysis demonstrated that GPT-4 possesses the ability to accurately label tweets that involve logical reasoning from contextual information. Alizadeh et al.

[383] compared the performances of GLLMs like Chat- GPT, open-source LLMs like FLAN [389] and MTurk annotators in labelling data (tweets and news articles) for five text classification tasks. The authors reported that ChatGPT achieves the best results, outperforming both open-source LLMs and MTurk annotators. One promising observation here is that open-source LLMs outperform MTurk annotators, and the performance is comparable to ChatGPT.

7.2       Data Augmentation

Overview. The performance of downstream task-specific models is determined by the quality as well as the quantity of labelled data. Fine-tuning the pretrained language models on a small amount of labelled data will result in overfitting [1] and, subsequently, poor performances. However, it is not feasible all the time to label a large number of instances as the annotation process is expensive. So, the research community focused on alternative approaches like data augmentation to increase the size of training sets in a relatively inexpensive way [398]–[402]. The data augmentation approaches focus on generating additional training instances either by making small changes to the existing instances or creating new instances with a distribution similar to the existing instances.

Data augmentation is initially explored in the area of computer vision [398] and then explored in natural language processing [399]–[402]. When compared to computer vision, text data augmentation is more challenging because of the discrete nature of text. Data augmentation can be done at character, word and sentence levels. Character-level data augmentation approaches involve random deletion, addition, exchange or insertion of characters [403], [404]. For example, in the case of keyboard augmentation, a random character is replaced with its neighbour based on the QWERTY layout [403]. Similar

PaperGLLMs ExploredTask(s)Prompt SettingsDomain(s)Language(s)
[390]ChatGPTIntent ClassificationZSGeneralEnglish
[391]ChatGPTMachine TranslationZSGeneralKorean, Ger- man
[392]GPT-3Named Entity RecognitionZSNews, Social Media, General, HealthcareEnglish
[393]ChatGPT, GPT-4Question AnsweringZSHealthcareEnglish
[394]GPT-3Text ClassificationFSGeneralEnglish
[395]ChatGPTMedical Event Classification, Medication IdentificationZSHealthcareEnglish
[143]GPT-3Intent ClassificationZSSocial MediaEnglish
[396]ChatGPTText ClassificationZSGeneral, HealthcareEnglish
[397]ChatGPTOpen Intent DetectionZSGeneralEnglish

TABLE 18. Summary of research works exploring GLLMs for paraphrasing-based data augmentation.

to character-level data augmentation, word-level data augmentation approaches involve deletion, replacement, exchange or insertion of words at random positions [405], [406]. Sentence-level approaches like back translation and paraphrasing generate augmented instances by rewriting the sentence [407], [408]. Overall, the main drawbacks of existing data augmentation approaches are (i) lack of sufficient diversity in the augmented instances and (ii) often struggle to guarantee the accurate labelling of the augmented data [396]. To address these draw- backs, the research community focused on leveraging the exceptional generating abilities of GLLMs for data augmentation to ensure sufficient diversity and correct labelling in the augmented data.

  • 7.2.1       Paraphrasing

Research works exploring GLLMs for paraphrasing- based data augmentation. The research community explored GLLMs for paraphrasing in various NLP tasks like intent classification [143], [390], [397], machine translation [391], named entity recognition [392], question answering [393], medical event classification [395], medication identification [395] etc. GLLM-based paraphrasing is explored in multiple domains like general [390]–[392], [394], [396], [397], news [392], social media [143], [392] and healthcare [392], [393], [395], [396]. Table 18 presents a summary of research works exploring GLLMs for paraphrasing-based data augmentation.

Cegin et al. [390] compared the quality of paraphrases generated by ChatGPT and crowd workers for intent classification. The authors reported that (i) ChatGPT generates more diversified paraphrases compared to crowd-workers and (ii) the robustness of models fine- tuned on ChatGPT is comparable to the models fine- tuned on crowd-workers generated paraphrases. Oh et al. [391] explored ChatGPT-based data augmentation to generate additional training instances to fine-tune the mBART-50 model [213] for machine translation involving Korean-German language pairs. Here, the authors explored three different prompting strategies, out of which the storytelling prompting approach achieves the best results and improves the BLUE score by 0.68. Here, the storytelling prompting approach involves generating a three-sentence story based on the source sentence and then translating each of these sentences into the target language. Abaskohi et al. [394] proposed a novel approach based on prompt-based tuning and contrastive learning to fine-tune pretrained language models for text classification. As contrastive learning requires data augmentation, the authors explored models like GPT-3 and OPT-175B [39] for paraphrasing. Experiment results showed that GPT-3 based paraphrasing outperforms existing data augmentation approaches like back trans- lation [427] and easy data augmentation [405].

To overcome the problem of limited training instances for EHR analysis, sarker et al. [395] explored Chat- GPT to generate additional training instances through paraphrasing. Experiments on medication event classification and medical identification tasks revealed that fine-tuning the pretrained language models on ChatGPT augmented training set enhances the performance. Dai et al. [396] proposed AugGPT, a ChatGPT-based approach to generate additional training instances by paraphrasing existing training instances for few-shot classification. Experiments on general and medical domain text classification datasets revealed that AugGPT outperforms all the existing data augmentation approaches by a good margin. Further analysis showed that AugGPT generates more diversified instances while preserving the original labels.

Paraphrasing-based data augmentation for entity ex- traction is challenging because of the difficulty in pre- serving span-level labels. Sharma et al. [392] explored GPT-3 models, back translation and PEGASUS-based paraphraser for synthetic data generation using para- phrasing. The authors observed that the larger GPT-3 variant with inline annotations achieves the best results for entity extraction across datasets from multiple do- mains.

PaperGLLMs ExploredTask(s)Prompt SettingsDomain(s)Language(s)
[409]ChatGPTText ClassificationZSSocial MediaChinese
[410]ChatGPTNote2Dialogue GenerationZSHealthcareEnglish
[411]GPT-3.5Training Phi-1 LLMZSProgrammingEnglish
[412]ChatGPT, GPT-4Cross-lingual Common Sense Rea- soningFSGeneralMultiple Languages
[413]GPT-3Hate Speech DetectionFSSocial MediaEnglish
[414]GPT-3Undesired Context DetectionZS, FSSocial MediaEnglish
[415]ChatGPT, GPT-4Question AnsweringZSHealthcareEnglish
[143]GPT-3Intent ClassificationZSGeneralEnglish
[416]GPT-3.5, GPT-4Training Smaller LLMsZSGeneralEnglish
[155]GPT-3.5Relation ExtractionFSGeneral, Scientific LiteratureEnglish
[417]GPT-4CoT Instruction TuningFSGeneralEnglish
[418]GPT-4Instruction TuningZSGeneralEnglish, Chinese
[419]GPT-3Call segmentation, Topic extractionZSDialogueEnglish
[420]GPT-3Paraphrase DetectionZSGeneral, Scientific LiteratureEnglish
[421]ChatGPTTweet Intimacy PredictionFSSocial MediaMultiple Languages
[422]ChatGPTNamed Entity Recognition, Relation ClassificationZSHealthcareEnglish
[423]ChatGPTTopic ClassificationZSNews, Social MediaEnglish
[424]ChatGPTNeural Machine TranslationZSGeneralMultiple Languages
[425]GPT-3, CodexTable Question AnsweringZSGeneralEnglish
[426]GPT-4Text Generation EvaluationZSGeneralMultiple Languages

TABLE 19. Summary of research works exploring GLLMs for data generation-based data augmentation. Here ZS represents zero-shot and FS represents few-shot.

  • 7.2.2       Data Generation

Research works exploring GLLMs for data generation- based data augmentation. The research community explored GLLMs for data generation-based data augmentation in various NLP tasks like dialogue generation [410], training smaller LLMs [411], [416], com- mon sense reasoning [412], hate speech detection [413], undesired content detection [414], question answering [415], [425], intent classification [143], relation extraction [155], [422], instruction tuning [417], [418], paraphrase detection [420], tweet intimacy prediction [421], named entity recognition [422], machine translation [424] etc. GLLM-based data generation for data augmentation is explored in multiple domains like general [143], [155], [412], [416]–[418], [420], [424]–[426], social media [409], [413], [414], [421], [423], news [423], scientific literature [155], [420], healthcare [410], [415], [422], dialogue [419], programming [411] etc. Table 19 presents a summary of research works exploring GLLMs for data generation- based data augmentation.

Some of the research works explored GLLMs for data generation-based data augmentation in various text classification tasks [143], [409], [413], [414], [421], [423]. For example, Hartvigsen et al. [413] used GPT-3 with demonstration-based prompting to create a large-scale synthetic dataset for the detection of implicit hate speech. Here, the authors explored a variant of con- strained beam search to ensure subtle toxicity in the generated examples. Michail et al. [421] investigated the effectiveness of ChatGPT-generated synthetic data to fine-tune multilingual models for tweet intimacy prediction in the case of languages with no labelled instances. Here, ChatGPT is prompted with instructions and examples from a high-resource language and asked to generate new examples in the target language. Most of the existing research works use simple prompts for data generation, limiting the diversity of the generated synthetic data. To address this, Yu et al. [423] proposed a novel approach that leverages attributed prompts for data generation to increase the diversity in the generated data. Based on the evaluation on four topic classification datasets, the authors observed that (i) the proposed approach enhances the model performance and (ii) reduces the querying cost of ChatGPT by a large margin.

Some of the research works explored GLLMs for data generation-based data augmentation in various information extraction tasks like relation extraction [155], relation classification [422] and named entity recognition [422]. Xu et al. [155] evaluated how effective is the GPT-3.5 model for relation classification. To address the data scarcity problem in few-shot settings, the authors used the GPT-3.5 model to generate additional data. The prompt used for data generation consists of instance descriptions along with some example instances. Tang et al. [422] used ChatGPT in zero-shot settings to generate synthetic data for tasks like named entity recognition and relation classification in the healthcare domain. The authors showed that the model fine-tuned on this synthetic data outperforms zero-shot ChatGPT by a large margin in both tasks.

Some of the research works explored GLLMs for data generation in LLM development stages, like LLM pre- training [411], [416] and instruction tuning [417], [418]. Gunasekar et al. [411] trained Phi-1, a code LLM using GPT-3.5 generated synthetic textbook and code data. Here, the training corpus includes 1B tokens of GPT-3.5 generated Python textbook and code data along with 6B tokens of code data from the web. Eldan et al. [416] explored GLLMs like GPT-3.5 and GPT-4 models to gen- erate TinyStories, a synthetic dataset of stories with only the words understood by typical 3 to 4-year-old kids. The authors demonstrated that the GLLM generated dataset can be used to train smaller LLMs, which can generate coherent and consistent stories with near-perfect gram- mar. Instruction tuning requires large human-annotated datasets, which are often difficult to obtain. Stanford Al- paca 4 and Vicuna 5 showed the effectiveness of synthetic instruction tuning datasets generated using GPT-3.5 and ChatGPT, respectively. Inspired by the success of these models, Peng et al. [418] explored advanced models like GPT-4 to generate instruction-tuning datasets in English and Chinese languages. The experiment results showed that GPT-4 generated instruction tuning datasets further enhance the zero-shot performance of LLaMA models. Liu et al. [417] used GPT-4 to generate LogiCoT, a synthetic dataset of CoT rationales. This dataset can be used for instruction tuning the LLMs to enhance their logical reasoning abilities.

8         DETECTING GLLM GENERATED TEXT

Overview. GLLMs demonstrated extraordinary human- like capabilities to understand user queries, follow the instructions and then answer the user queries with high- quality content. Apart from responding to user queries, these models can also generate news articles, research papers, code and essays with human-like fluency. With the ability to generate text with human-like fluency, these models are widely adopted in a variety of real-world applications like writing assistants, coding assistants, chatbots, etc [428]. Although there is a lot of excitement about GLLMs and their applications in recent times, there are also growing concerns regarding the potential misuse of these models for illegal activities [429], such as fake news on social media platforms [430], [431], fake reviews on e-commerce websites [432], fake research papers [433], academic fraud [434], etc. For example, these models can be easily used by malicious users to create fake news [430], [431] and propagate on social platforms at a large scale to exaggerate or manipulate the facts to get an undue advantage, especially during political campaigns. Similarly, students can use these models to write their assignments or generate code for their projects [434], and GLLM generated fake research papers [433] can have a serious impact on the scientific community as these papers are written without conducting any experiments.

There is a strong need for the development of approaches to detect GLLM generated text, as there are growing concerns regarding the misuse of GLLMs. Such approaches help to distinguish the GLLM generated text from human-generated text and verify the source as well as the authenticity of the information. However, detecting GLLM generated text is more challenging as models like ChatGPT and GPT-4 can generate content with human-like fluency.

Research exploring the detection of GLLM generated text. To avoid misuse and ensure the safe use of these models, the research community focused on developing approaches to identify the GLLM generated text accurately. The recent research works explored the detection of GLLM generated text in multiple domains like scientific literature [435]–[438], academic [439], [440], healthcare [438], [441], [442], news [443], legal [429], [442], social media [432], [438], Finance [429] etc. Most of the research works focused on the English language, while a few research works focused on other languages like Japanese [436], German [438] and Spanish [440]. Table 20 presents a summary of research works exploring the detection of GLLM generated text.

Some of the research works focused on assessing the effectiveness of the existing machine-generated text detection tools to detect GLLM generated text. A number of online tools are available, ranging from simple classifiers based on logistic regression to advanced classifiers based on pretrained language models to detect ChatGPT-generated text. To assess the effectiveness of these tools, Pegoraro et al. [444] introduced a dataset having ChatGPT-generated responses for questions from various domains like finance, medicine, etc., and user- generated responses from social media platforms. The comprehensive evaluation showed that the maximum success rate of these tools is less than 50% only, which leaves a lot of room for improvement. Orenstrakh et al. [440] evaluated the effectiveness of eight popular detectors using three metrics, namely resilience, false positives and accuracy. The authors observed that Copy- Leaks, GPTKit and GLTR achieve the best results for the metrics accuracy, false positives and resilience. However, all these detectors struggle with non-English languages and paraphrased LLM-generated text. There is a lack of comprehensive evaluation benchmark to detect machine- generated text as the existing approaches use different models, datasets and settings. To address this, He et al.

PaperDetectApproachSatisfactory PerformanceTraining FreeDomain(s)Language(s)
[444]ChatGPT generated textEvaluate multiple online toolsNoMultiple do- mainsEnglish
[435]GPT-3 generated textClassifiers based on machine learning models like LR, SVM and deep learning models like LSTM and BERTYesNoScientific Lit- eratureEnglish
[436]ChatGPT and GPT-4 generated textClassifier based on random forest and stylometric featuresYesNoScientific Lit- eratureJapanese
[439]GPT-3                 and ChatGPT generated textClassifier based on models like SVM and RoBERTaYesNoAcademicEnglish
[437]ChatGPT generated textClassifier based on models like RoBERTaNoNoScientific Lit- eratureEnglish
[441]ChatGPT generated textClassifier based on models like BERTYesNoHealthcareEnglish
[440]ChatGPT generated textEvaluate multiple online toolsYesAcademicEnglish, Spanish
[443]GPT-3 generated textEvaluate human evaluatorsNoStories, News, RecipiesEnglish
[442]ChatGPT and GPT-4 generated textClassifier based on models like BERT and RoBERTaYesNoLaw, Medical, Dialogue, GeneralEnglish
[438]GPT-3.5, ChatGPT and GPT-4 gener- ated textTraining free divergent N-gram AnalysisYesYesHealthcare, Social Media, Scientific LiteratureEnglish, Ger- man
[445]ChatGPT generated textEvaluate the robustness of existing detec- torsNoGeneralEnglish
[446]ChatGPT generated textEvaluate existing plagiarism toolsNoGeneralEnglish
[447]ChatGPT generated textPropose benchmark and evaluate exist- ing detectorsYesGeneralEnglish
[432]ChatGPT generated textPropose novel approach based on Distil- BERT and SHAP to detect and explainYesNoSocial MediaEnglish
[429]ChatGPT generated textIntroduce new dataset and evaluate mul- tiple existing detection modelsYesGeneral, Finance, Healthcare, Legal                , PsychologyEnglish
[448]GPT-3                 and ChatGPT-based botsPropose FLAIR to detect online GPT-3 and ChatGPT-based botsYesYesGeneralEnglish
[449]ChatGPT generated textClassifiers based on models like RoBERTa and T5YesNoGeneralEnglish
[428]ChatGPT generated textPropose a zero-shot approach based on local optimalityYesYesGeneralEnglish
[450]ChatGPT generated textPropose an approach based on Siamese Network and binary classifierYesNoGeneralEnglish
[451]ChatGPT polished textTrains classifier and polish ratio models to detect and explainYesNoGeneralEnglish
[452]GPT-3.5 generated textEvaluate robustness using paraphrase at- tacksNoGeneralEnglish

TABLE 20. Summary of research works exploring the detection of GLLM generated text.

[447] proposed MGTBench, the first machine-generated text detection benchmark. Evaluation on this benchmark showed that, except for the ChatGPT detector [429] and LM detector [453], the performance of other de- tectors is not satisfactory. Guo et al. [429] introduced the HC3 dataset, having human-authored and ChatGPT-generated responses to questions from multiple domains like legal, healthcare, finance, psychology, etc. The performance of existing detection approaches on the HC3 dataset is just satisfactory, and linguistic analysis showed that human-authored answers are short in length but use a large vocabulary compared to ChatGPT-generated answers.

Some of the research works focused on develop- ing approaches based on trained classifier models to detect GLLM generated text. Theocharopoulos et al.

[435] evaluated the effectiveness of classifiers based on models like logistic regression, support vector machine, LSTM, and BERT to identify GPT-3 generated scientific abstracts. The LSTM-based classifier with word2vec embeddings achieves an accuracy of more than 98% and outperforms other classifiers. Zaitsu et al. [436] observed that LLM-generated texts differ significantly from human-written texts in terms of stylometric features. The authors demonstrated that random forest trained with different stylometric features can identify the LLM- generated Japanese text with 100% accuracy. Liu et al.

[439] reported that fine-tuned RoBERTa model achieves an accuracy of more than 90% on the AruGPT dataset of human-written and GLLM generated argumentative essays. Moreover, linguistic analysis revealed that GLLM generated texts tend to be more complex syntactically, while human-generated texts are lexically more com- plex. To facilitate the development of a ChatGPT-written abstract detector, Yu et al. [437] introduced CHEAT, a large dataset of ChatGPT and human-written ab- stracts. Based on the evaluation of multiple existing approaches like ZeroGPT, OpenAI detector, ChatGPT- detector-roberta [429] and ChatGPT-qa-detector-roberta [429], the authors reported that performance is away from satisfactory and the human involvement further increases the detection difficulty. Zhan et al. [442] treated the detection of LLM generated as a binary classification problem and proposed a novel approach based on fine- tuned RoBERTa model. The authors reported that the proposed approach exhibits good performance and also has the ability to detect the text generated using a detection evasion technique. Mitrovic et al. [432] pro- posed a novel approach based on DistilBERT [92] and SHAP [454] to detect the machine-generated text and explain the reasoning. The proposed approach achieves an accuracy of 79%, and based on the explanations, the authors observed that ChatGPT-generated text maintains a polite tone, lacks specific details and generally refrains from expressing emotions.

Chen et al. [449] introduced OpenGPTText, which includes ChatGPT-generated paraphrased text. The au- thors reported that fine-tuned classifiers based on models like RoBERTa and T5 can achieve impressive results in detecting ChatGPT-generated text with an accuracy of more than 97%. Yu et al. [450] introduced GPT-Pat, a novel approach based on ChatGPT, a Siamese network and binary classifier, to detect machine-generated text effectively. The proposed approach enhances the SOTA accuracy by more than 12% and also exhibits better robustness to attacks like re-translation and text polishing. Yang et al. [451] focused on detecting GLLM-polished text, which is more challenging and useful in real-world applications. The proposed approach involves training a classification model to identify the machine-generated text and a polish ratio (regression) model to explain the ChatGPT involvement. A Polish ratio of 0.2 indicates ChatGPT involvement and a value of more than 0.6 represents the text is entirely ChatGPT generated.

Training-based approaches to detect LLM-generated text have limited flexibility, especially when used for new domains [438]. To overcome this drawback, some of the research works focused on developing training- free approaches to detect GLLM generated text. Yang et al. [438] proposed DNA-GPT, a training-free approach based on divergent n-gram analysis. With the proposed approach, the authors achieved SOTA results on both English and German datasets. Wang et al. [448] proposed a novel framework called FLAIR to detect LLM-based bots with a single question in an effective way. The results showed that the proposed approach is effective and a good alternative to existing CAPTCHA-based approaches. Mireshghallah et al. [428] investigated whether models other than the generator can be used to identify machine-generated text. In general, smaller models serve as more effective universal text detectors. These models exhibit better accuracy in identifying text produced by both small and larger models. For example, OPT-125M achieves better results compared to the GPT-J 6B model in detecting ChatGPT-generated text.

Some of the research works focused on assessing the robustness of machine-generated text detectors towards different attacks. Shi et al. [445] evaluated the robust- ness of existing detectors using attacks like synonym word replacement and writing style modification. The authors implemented both attacks using LLMs. The results showed that the existing detectors are not robust to the attacks, which emphasizes the need for more robust and reliable detectors to detect and avoid the misuse of LLMs. Krishna et al. [452] showed that existing detectors like OpenAI detector, GPTZero and DetectGPT

[463] are not robust to paraphrase attacks. For example, paraphrase attacks result in a drop of more than 65% accuracy in the case of DetectGPT.

Some of the research works focused on assessing the effectiveness of humans in identifying GLLM generated text. For example, Clark et al. [443] observed that non- expert evaluators are unable to differentiate GPT-3 generated text from human-authored text in three different domains, namely news, recipes and stories. The reason for this is the evaluators arrived at their decisions based on surface-level features without considering the advanced text generation capabilities of the GPT-3 model.

9         ROBUSTNESS OF GLLMS

Overview. GPT-3 family large language models achieve impressive performances in zero and few-shot settings in many NLP tasks. In some tasks like text classification [144], relation extraction [156], etc. GLLMs without any explicit fine-tuning outperform state-of-the-art fine- tuned models. For example, Sun et al. [144] demon- strated that InstructGPT, with the advanced prompting

PaperGLLMs ExploredTask(s)Prompt SettingsRobustnessDomain(s)Language(s)
[455]GPT-3, GPT-3.5Nine NLU TasksZS, FSAdversarial InputGeneralEnglish
[456]GPT-3.5, ChatGPTFour NLU Tasks, Machine TranslationZSOut of DistributionGeneral, MedicalEnglish
[457]CodexSemantic ParsingZS, FSAdversarial InputProgrammingEnglish
[458]ChatGPTEight Tasks including Four NLU tasksZS, FSAdversarial PromptGeneralEnglish
[459]Codex, InstructGPT, ChatGPTCode GenerationZSAdversarial PromptProgrammingEnglish
[425]GPT-3, CodexTable Question AnsweringFSAdversarial InputGeneralEnglish
[460]ChatGPTFourteen IE TasksZS, FSAdversarial PromptGeneralEnglish
[461]ChatGPT, GPT-4Question AnsweringZS, FSOut-of-DistributionGeneralEnglish
[462]ChatGPTText-to-SQL GenerationZSAdversarial InputGeneralEnglish

TABLE 21. Summary of research works exploring GLLMs robustness to out-of-distribution instances, adversarial prompts and adversarial inputs. Here ZS represents zero-shot, and FS represents few-shot.

strategy, achieves SOTA results using just 16 examples on four text classification datasets. Similarly, Wan et al. [156] achieved SOTA results in relation extraction with the GPT-RE framework. However, to increase the reliability of these models in real-world applications, especially in critical domains like medicine, it is essential to systematically study the robustness of these models in various scenarios. Adversarial robustness refers to the model’s ability to maintain good performance even in the case of deliberately crafted instances [464], [465]. These instances are called adversarial instances and are carefully designed by making subtle changes in the original inputs to deceive the model. Out-of- distribution (OOD) instances refer to examples that differ significantly from the data distribution used to train the model [466]. These instances fall outside the range of the model’s training data and present challenges to the model’s performance and generalization ability. Some of the recent research works focused on evaluating the robustness of GLLMs to out-of-distribution instances [456], [461], adversarial prompts [458]–[460] and adversarial inputs [425], [455], [457], [462] in one or more natural language processing tasks. Table 21 presents a summary of research works assessing GLLMs robustness to out-of-distribution instances, adversarial prompts and adversarial inputs.

Research works exploring GLLMs robustness. Some of the research works evaluated the robustness of GLLMs in specific tasks like semantic parsing [457], code generation [459], table question answering [425], multi-choice question answering [461] and text-to-SQL generation [462]. Zhuo et al. [457] reported that Codex- based semantic parsers are not robust to adversarial examples, and the robustness can be enhanced using few-shot in-context learning. Shirafuji et al. [459] studied the robustness of GPT-3 family models like Codex, InstructGPT, and ChatGPT to adversarial prompts in code generation task. The authors observed that InstructGPT and ChatGPT exhibit better robustness compared to

Codex. However, there is much room for improvement, indicating that quality code generation requires well- designed prompts. Zhao et al. [425] proposed RobuT, a benchmark to systematically study the robustness of large language models to adversarial inputs in table question answering. The authors reported that GLLMs like GPT-3 and Codex exhibit better robustness than fine-tuned models. Moreover, the authors demonstrated that GLLM generated adversarial inputs can enhance the adversarial robustness of fine-tuned models. Liu et al. [461] reported that ChatGPT and GPT-4 perform well in multiple choice question answering but struggle to answer out-of-distribution questions. Liu et al.

[462] showed that ChatGPT exhibits impressive zero- shot performance in Text-to-SQL generation. Moreover, ChatGPT demonstrates better robustness to adversarial inputs than SOTA models in text-to-SQL generation.

Some of the research works evaluated the GLLM robustness in multiple natural language understanding and generation tasks [455], [456], [458], [460]. Chen et al. [455] assessed the robustness of GPT-3 and GPT-3.5 models on 21 datasets covering nine natural language understanding tasks. Here the authors used adversarial text transformations from TextFlint [467]. The authors observed that the models are robust in tasks like ma- chine reading comprehension and exhibit performance degradation of more than 35% in tasks like sentiment analysis and natural language inference. Wang et al.

[456] evaluated the robustness of GPT-3.5 and ChatGPT models on adversarial and out-of-distribution (OOD) samples on nine datasets covering four NLU tasks and machine translation. The authors observed that ChatGPT exhibits good performances on adversarial and OOD samples, but still, there is much room for improvement.

Zhu et al. [458] developed PromptBench, a bench- mark with more than 4k adversarial prompts to evaluate the robustness of large language models to adversarial prompts. The benchmark covers 13 datasets spanning eight tasks, including four NLU tasks. The authors observed that GLLMs are not robust to adversarial prompts. Moreover, word-level attacks are the most effective, which results in a performance drop of more than 30%. Based on the evaluation of ChatGPT on fourteen information extraction sub-tasks, Han et al.

[460] showed that ChatGPT is vulnerable to adversarial prompts, i.e., the performance is greatly affected by including irrelevant context in the prompt.

10         GLLMS AS EVALUATORS

Overview. Natural language processing tasks can be broadly classified into natural language understanding (NLU) and natural language generation (NLG). NLU involves the interpretation of text, while NLG involves generating human-like text. The evaluation of NLU outputs is pretty straightforward, while the evaluation of NLG outputs is challenging because of the diversity and inherent complexity of the text [468]. Moreover, the NLG evaluation involves assessing the generated text outputs in multiple dimensions, such as coherence, fluency, naturalness and semantic consistency. Human evaluation and automatic evaluation are two existing approaches for NLG evaluation. The human evaluation depends on competent annotators for an accurate and reliable assessment [469].

Human Evaluation vs Automatic Evaluation. Human evaluation is treated as the gold standard, but it is time- consuming, expensive, difficult to scale, inconsistent, and not reproducible [468], [481]. To address the issues with human evaluation, automatic evaluation metrics are developed, which fall broadly into two categories: n- gram-based and embedding-based. N-gram-based metrics assess the quality based on the lexical overlap be- tween the generated and reference texts. Some of the commonly used n-gram-based metrics are BLEU [487], ROUGE [488] and METEOR [489]. However, these metrics have a poor correlation with human scores because of their inability to capture semantic meaning [490]. Later, with the evolution of transformers and pretrained language models, the researchers developed embedding- based metrics like BERTScore [491], MoverScore [492], BARTScore [493], CodeBERTScore [494] etc. These metrics leverage the pretrained language models and assess the quality based on the semantic similarity between the generated and reference text. The main drawback of the existing automatic evaluation metrics is the requirement for references, which are difficult to obtain, especially in low-resource domains. Moreover, with just a few references, it is not possible to get an accurate and reliable assessment as few references cannot account for all the semantic variations [468]. So, there is a strong need for automatic evaluation metrics which are reference-free.

GLLM-based Evaluation. Recently, with the huge success of GLLMs in most of the NLP tasks, the research community focused on developing automatic evaluation metrics based on these models. These models possess the ability of in-context learning, while instruction tuning enables these models to align themselves with human evaluation [2]. These two abilities enable these models to imitate the behaviour of human evaluators, who typically evaluate natural language generation task out- puts by understanding instructions and the given examples. The GLLM-based evaluation metrics demonstrate a strong correlation with human scores even in the absence of reference outputs [472], [477]. Table 22 presents a summary of research works exploring GLLM-based evaluation for various natural language generation tasks.

Research works exploring GLLM-based evaluation. The NLP researchers proposed various GLLM-based evaluation frameworks to evaluate the outputs of various NLG tasks like code generation [470], text style transfer [471], text summarization [468], [472], [475]–[480], [482], [483], dialogue generation [468], [472], [477], machine translation [426], [473], [474], [477], [480], [485], story generation [468], [483], paraphrase generation [468], text-to-image synthesis [285], data-to-text generation [477], [483], image captioning [480], text generation [481], open-ended question answering [484], [486]. Most of the research works proposed evaluation frame- works using direct prompting, while some of the re- search works introduced evaluation frameworks based on advanced prompting strategies like chain-of-thoughts [470], [472] and error analysis prompting [474]. Some of the proposed evaluation frameworks work with and without references [470], [473], [483], while some of them require references [426], [471], [474], [480], [485], and some don’t require any references [468], [472], [475]– [479], [481], [482], [484], [486].

Lai et al. [471] investigated how effective ChatGPT is to evaluate text style transfer task along three dimensions: fluency, content and style. The model achieves good correlations with human judgements, and the best results are obtained by using separate prompts for each dimension evaluation. Kocmi et al. [473] proposed GEMBA, a GPT-based metric to assess translation out- put quality, with references being optional. The authors reported that GPT-3.5 and higher models are only useful for the assessment, and GPT-4 achieves the best results. Based on the evaluation of four natural language generation tasks, paraphrase generation, text summarization, story generation and dialogue response generation, Chen et al. [468] showed that explicit score with greedy decoding strategy is the best way to assess NLG outputs using GLLMs like ChatGPT. Luo et al. [475] evaluated ChatGPT’s ability as a factual inconsistency evaluator for text summarization task. Experiment results showed that ChatGPT outperforms existing metrics on most of the datasets.

Shen et al. [476] explored how effective ChatGPT can be as a zero-shot evaluator for abstractive summarization systems using different evaluation methods like likert scaling [495] and head-to-head comparisons [496]. Extensive analysis showed that likert scaling implemented as a multiple-choice question gives the best and most stable results. Liu et al. [478] designed a novel approach which

PaperGLLMs ExploredTask(s)Prompt SettingsReferences RequiredDomain(s)Language(s)
[470]ChatGPTCode GenerationZSOptionalProgrammingFive Programming Languages
[471]ChatGPTText Style TransferZSYesGeneralEnglish
[472]ChatGPT, GPT-4Text Summarization, Dialogue GenerationZSNoGeneralEnglish
[473]GPT,           GPT-3.5, ChatGPT, GPT-4Machine TranslationZSOptionalGeneralEnglish, German, Chinese, Russian
[468]GPT-3.5, ChatGPTText Summarization, Dialogue Generation, Story Generation, Paraphrase GenerationZSNoGeneralEnglish
[474]GPT-3.5, ChatGPTMachine TranslationZS, FSYesGeneralEnglish, Chinese, German
[475]ChatGPTText SummarizationZSNoGeneralEnglish
[476]ChatGPTText SummarizationZSNoGeneralEnglish
[285]GPT-4Text-to-Image SynthesisZSN/AGeneralEnglish
[426]GPT-4Machine TranslationZSYesGeneralEnglish, German, Russian
[477]GPT-3, GPT-3.5Dialogue Generation, Machine Translation, Text Summarization, Data-to-Text GenerationZS, FSNoGeneralEnglish, Chinese
[478]GPT-3, ChatGPTText SummarizationZSNoGeneralEnglish
[479]ChatGPTText SummarizationZSNoGeneralEnglish
[480]GPT-3.5Machine Translation, Text Summarization, Im- age CaptionZSYesGeneralEnglish
[481]ChatGPT, GPT-4Text GenerationZSNoGeneralEnglish
[482]GPT-3.5Text SummarizationZSNoGeneralEnglish
[483]ChatGPTText Summarization, Story Generation, Data-to- Text GenerationZSOptionalGeneralEnglish
[484]GPT-4Open-ended Question AnsweringZSNoGeneralEnglish
[485]GPT-4Machine TranslationZSYesGeneralMultiple Languages
[486]GPT-4Open-ended Question AnsweringZSNoGeneralEnglish

TABLE 22. Summary of research works exploring GLLM-based evaluation for natural language generation tasks. Here ZS represents zero-shot, and FS represents few-shot.

uses BRIO [497], a contrastive learning-based method, to train smaller models like BART for text summarization and metrics like GPTScore [477] or GPTRank for evaluation. The contrastive learning training method helps the model to effectively utilize the supervision signal offered by the reference LLMs. The evaluation showed that the proposed approach helps the smaller model to outperform LLMs like GPT-3 and ChatGPT.

Gao et al. [479] evaluated ChatGPT for text summarization using various human evaluation methods and reported that (i) ChatGPT-based evaluation is both cost- effective and reproducible, unlike human evaluation, (ii) the performance of ChatGPT-based evaluation is highly dependent on the prompt design, and (iii) ChatGPT generated explanations correlates with its scores. Jain et al. [482] explored the effectiveness of the GPT-3.5 model as a multi-dimensional evaluator of text summarization. The authors reported that using in-context learning, GPT-3.5-based evaluation achieves SOTA performances on factual consistency and relevance dimensions. Based on the evaluation of five datasets covering text summarization, story generation and data-to-text generation, Wang et al. [483] reported that ChatGPT as an evaluator (i) exhibits good correlations with human scores, especially in the case of story generation task and (ii) is prompt sensitive. Bai et al. [484] introduced a novel evaluation framework called Language-Model-as- an-Examiner to evaluate open-ended questions. In this framework, GLLM acts as a knowledgeable examiner, generates questions using its own knowledge and then does the reference-free evaluation. Yang et al. [485] developed the BigTrans model (based on LLaMA -13B model) with a multilingual translation capacity of more than 100 languages. GPT-4 based assessment showed that BigTrans performance is on par with ChatGPT and Google translate. Zheng et al. [486] explored GPT-4 as a judge to evaluate open-ended question answering using two newly introduced benchmarks MT-Bench and Chat- bot Arena. The experiment results showed that GPT-4 achieves more than 80

Unlike the above-discussed research works, which used direct prompting, some of the works explored advanced prompting to offer better guidance and context for the GLLM evaluator. Zhuo et al. [470] developed a code generation evaluation framework based on Chat- GPT and demonstrated that the proposed framework outperforms CodeBERTScore [494] consistently across multiple programming languages. Moreover, the perfor- mance of the evaluation framework can be enhanced using references and zero-shot CoT prompting. Liu et al. [472] proposed G-EVAL, a novel framework based on GPT-4 for the assessment of natural language generation tasks. The proposed framework uses CoT prompting and a form-filling paradigm. Here, CoT prompting enhances the performance of G-EVAL by offering more guidance and context. The performance of ChatGPT- based evaluation in segment-level machine translation is poor. To overcome this, Lu et al. [474] proposed a novel prompting called Error Analysis (EA) prompting, which combines error analysis [498] and CoT prompting. The authors showed that with EA prompting, ChatGPT can assess translations at the segment level much better.

Some of the research works explored GLLMs for the evaluation of multi-modal AI tasks [285], fine-tuning open-source LLM evaluators [426], and paraphrasing references to enhance existing metrics based on pre- trained language models [480]. For example, Lu et al.

[285] introduced LLMScore (based on GPT-4), a new metric which can effectively capture both image and object-level compositionality for text-to-image synthesis evaluation. Some of the research works explored these models to fine-tune open-source LLMs so that they can be used as evaluators, which makes the evaluation less expensive. For example, Xu et al. [426] introduced InstructScore, a novel and explainable metric based on fine-tuned LLaMA model for text generation evaluation. Here the authors use GPT-4 generated synthetic data to fine-tune the LLaMA model. InstructScore can generate an error diagnostic report having error details along with an explanation. Natural language generation evaluation using few references results in poor correlation with human judgements. To overcome this drawback, Tang et al. [480] introduced Para-Ref, which leverages LLMs to increase the number of references by paraphrasing. The evaluation on three NLG tasks, text summarization, machine translation and image caption, showed that the proposed approach enhances the correlation of sixteen automatic evaluation metrics with human judgements by a good margin.

Some of the research works focused on addressing the limitations of using GLLMs as evaluators. For ex- ample, Wang et al. [481] demonstrated positional bias in GLLM-based evaluation, i.e., the order of candidate responses can significantly influence the results. The authors demonstrated that the two proposed strategies, namely multiple evidence calibration and balanced position calibration, can reduce the bias and enhance the correlation with human judgements.

  1. 11         FUTURE RESEARCH DIRECTIONS

11.1       Enhance Robustness of GLLMs

GLLMs achieved promising results across various NLP tasks in zero and few-shot settings across various NLP tasks. In some of the tasks like data labelling [373]–[375], [383], text classification [144], relation extraction [156],

question answering [132], [179], keyphrase generation [217], etc., these models achieved even SOTA results. However, some of the recent research works exposed the brittleness of these models towards out-of-distribution inputs [456], [461], adversarial prompts [458]–[460] and inputs [425], [455], [457], [462] . For example, Liu et al.

[461] reported that ChatGPT and GPT-4 perform well in multiple choice question answering but struggle to an- swer out-of-distribution questions. Similarly, Chen et al.

[455] observed more than 35% performance degradation for GPT-3 and GPT-3.5 models in tasks like sentiment analysis and natural language inference for adversarial inputs. The brittleness towards out-of-distribution and adversarial inputs makes these models unreliable and limits their practical utility, especially in sensitive do- mains. So, it is necessary for the research community to focus more on this research direction to make GLLMs more robust and enhance their reliability and usage.

11.2       Red Teaming

Red teaming involves an assessment to expose undesirable model behaviours like generating harmful text [499]–[502]. GLLMs trained over large volumes of text data with a simple next-word prediction objective are surprisingly good at generating text with human-like fluency. However, the other side is that these models sometimes generate harmful text. For example, Risabh et al. [499] observed that GLLMs like ChatGPT and GPT-4 generate answers to more than 60% of harmful queries. One of the possible reasons for this undesirable behaviour of GLLMs is that data used for pretraining these models includes toxic, biased and noisy text to some extent [499]. This unwanted behaviour of generating harmful text raises concerns and limits the scalable deployment of these models for public use. We can expect more research in future to expose such undesirable behaviour in various scenarios and eventually enhance the safety alignment as well as the safe use of GLLMs.

11.3       State-Of-The-Art Results Across NLP Tasks

In the beginning, GLLMs like GPT-3 achieved impressive performances in zero and few-shot settings across NLP tasks. Advanced GLLMs like ChatGPT and GPT- 4 further pushed the results but still lag behind SOTA results achieved by pretrained language models fine- tuned based on supervised learning. Later, with the evolution of advanced prompting strategies and novel approaches, GLLMs are able to achieve SOTA results in some of the NLP tasks. For example, InstructGPT with CARP prompting strategy using just 16 examples achieves SOTA results on four text classification datasets [144]. Similarly, Wan et al. [156] achieved SOTA results in relation extraction with the novel GPT-RE framework. Yang et al. [179] proposed a novel approach which uses GPT-3 as an implicit knowledge source and achieves SOTA results in knowledge-based visual question answering. In future, we can expect more focus from the re- search community to achieve SOTA results using GLLMs in as many NLP tasks as possible, which will be treated as a further push towards artificial general intelligence. Moreover, this eliminates the painful process of labelling large amounts of data and then fine-tuning pretrained language models separately for each downstream task.

11.4       Robust Approaches to Detect GLLM Generated Text

The ability to generate text with human-like fluency resulted in the wide adoption of GLLMs in various real-world applications like writing assistants, coding assistants, and chatbots [428]. There is a growing concern regarding the misuse of these models for various illegal activities [429], like fake news on social media platforms [430], [431], fake reviews on e-commerce websites [432], fake research papers [433], academic fraud [434], etc. The performance of existing approaches like DetectGPT, Ze- roGPT, OpenAI detector, ChatGPT-detector-roberta and ChatGPT-qa-detector-roberta is not satisfactory [437], [444]. Moreover, the existing approaches are not robust to various attacks like paraphrasing, synonym word replacement and writing style modification [445], [452]. So, there is a great need for better approaches which can reliably detect GLLM generated text and also robust to various attacks, including paraphrasing. With reliable and robust detection approaches, the misuse of GLLMs for various illegal activities can be reduced to a great extent.

11.5       Reduce Inference Costs

GLLMs achieve impressive performances across NLP tasks, with SOTA results in some tasks. However, the downside of using GLLMs is the high inference costs [503], [504]. For example, a small business is required to spend more than $21,000 monthly to use GPT-4 for better customer support 6. Such high inference costs have become a burden to small and medium-sized companies. Recently, Chen et al. [503] proposed FrugalGPT, a novel framework involving multiple strategies like prompt adaptation and LLM approximation to reduce the inference costs of GLLMs. The inference costs of GLLMs increase with the prompt size as the inference cost is computed based on the number of tokens processed. Prompt adaptation focuses on reducing the size of the prompt by using fewer but effective examples or querying the GLLMs as a batch. LLM approximation uses cache to avoid querying GLLM for similar queries, which eventually reduces overall inference costs. Similarly, Cheng et al. [504] proposed batch prompting, which involves GLLM inference in batches rather than processing one sample individually. The authors demonstrated that the proposed prompting strategy reduces Codex model inference cost across ten datasets with little or no degradation in the performance. Future research in this direction will result in much better approaches which will further reduce the GLLM inference costs and make GLLM usage more affordable for companies.

11.6       Enhance Performance in Domain-Specific NLP Tasks

Inspired by the success of GLLMs in general domain NLP tasks, the research community explored GLLMs for NLP tasks in specific domains like healthcare, legal, finance, etc. However, the performances of GLLMs in domain-specific NLP tasks are not as impressive as those achieved in general domain NLP tasks [136], [326], [335], [346], [347], [356]. For example, Moradi et al. [326] reported that the BioBERT model outperforms GPT-3 in few-shot settings even though the BioBERT model is 514 times smaller than GPT-3. Chalkidis et al. [346] evaluated ChatGPT on the LexGLUE benchmark and reported that ChatGPT performs poorly on legal text classification datasets. Analyzing domain-specific texts is more challenging because of domain-specific terminology and abbreviations, complex language structures, etc. In do- mains like healthcare, finance and legal, domain experts use many words and abbreviations that are specific to the domain and not commonly found in general domain texts. There is a lot of scope to improve the performance of GLLMs in domain-specific NLP tasks, which reduces the bottleneck for the widespread adoption of these models in specific domains.

11.7       Handle Limited Context Length

One of the major drawbacks of GLLMs is their limited context length [52], [505], [506]. The maximum context length of GLLMs lies in the range of 2049 tokens to 32,768 tokens7. This limited context length poses a challenge and becomes a bottleneck for GLLMs to handle long documents or maintain long conservations in which the number of tokens falls beyond the maximum context length. Recently, Li [505] proposed selective context, a novel approach to effectively utilize the limited context length by filtering out the less useful content in the input text. The authors demonstrated the effectiveness of the proposed approach using the ChatGPT model for question-answering and text summarization tasks across datasets having lengthy input instances. Future research in this direction will help in the evolution of more efficient approaches which will effectively utilize the limited context length and eliminate the bottlenecks for the application of GLLMs in tasks that require processing long inputs.

11.8       Ensure Fair Evaluation of GLLMs

GLLMs achieved impressive performances across NLP tasks and have received much attention recently. How- ever, one concern regarding the evaluation of GLLMs is data contamination, which refers to the presence of test data instances of downstream tasks in the training corpus of GLLMs [46], [507], [508]. The problem of data contamination is more relevant in the case of GLLMs because of their proprietary nature and non-disclosure of training corpus details. Recent research works have re- ported the problem of data contamination in GLLMs like ChatGPT [508] and GPT-4 [507]. For example, Golchin et al. [507] demonstrated that GPT-4 is contaminated with instances from text classification, natural language inference and text summarization datasets like WNLI [509], AG News [510] and XSUM [511]. Recently, golchin et al. [507] proposed a novel approach to detect data contamination for LLMs. Future research must focus on developing simple and effective approaches to identify data contamination and ensure fair evaluation, enhancing the reliability of impressive performances of GLLMs.

11.9       Reduce Hallucinations

Despite the remarkable performances of GLLMs, there is a growing concern regarding their tendency to generate factually incorrect information [512], [513]. This tendency to generate text that doesn’t align with existing world knowledge, deviates from the user’s input or contradicts the context generated earlier is referred to as hallucination [512]. Hallucination is a serious problem yet to be addressed fully [514], and it reduces the reliability of GLLMs, which becomes a bottleneck for the adoption of GLLMs, especially in sensitive domains like healthcare [515]. Recently, some of the research works focused on evaluating hallucination in GLLMs [515], assessing the ability of GLLMs to identify hallucinations [516] and developing approaches to reduce hallucinations [517]. For example, Li et al. [516] proposed HaluEval, a novel benchmark to assess the ability of GLLMs to identify hallucinations. Peng et al. [517] introduced LLM- AUGMENTER, a novel approach that reduces hallucinations in ChatGPT without impacting the quality of generated responses. Considering the seriousness of the hallucination problem, we can expect more future research to identify and reduce hallucinations in GLLMs, [366]. This is because GLLMs are mostly pretrained on English text. For example, more than 90% of text in the pretraining corpus of the GPT-3 model is from the English language [4], [366]. Some of the possible options to enhance the performance of GLLMs for non-English languages are the use of English prompts [131], [363] and optimized tokenization [365]. There is a great need for better approaches to greatly enhance the performance of GLLMs for non-English languages, which increase their adoption across the globe and benefit users from non- English communities.

12         CONCLUSION

In this survey paper, we provide a comprehensive review of GPT-3 family large language models in multiple dimensions covering more than 350 recent research papers. Here, we present foundation concepts, GPT-3 family large language models and discuss the performances of these models in various downstream tasks, specific domains and multiple languages. We also discuss data labelling, data augmentation and data generation abilities of GLLMs, the robustness of GLLMs, the effective- ness of GLLMs as evaluators, and finally, conclude with multiple insightful future research directions. Overall, this comprehensive survey paper on GPT-3 family large language models will serve as a good resource for both academic and industry people to stay updated with the latest research.

Acknowledgments

The author would like to thank Ajit Rajasekharan for his encouragement and support.

REFERENCES

  • [1]         K. S. Kalyan, A. Rajasekharan, and S. Sangeetha, “Ammus: A survey of transformer-based pretrained models in natural language processing,” arXiv preprint arXiv:2108.05542, 2021.
  • L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems, vol. 35, pp. 27 730–27 744, 2022.
  • [3]         K. S. Kalyan, A. Rajasekharan, and S. Sangeetha, “Ammu: a survey of transformer-based biomedical pretrained language models,” Journal of biomedical informatics, vol. 126, p. 103982, 2022.
  • T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhari- wal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint which enhance their reliability and adoption across do- mains, including sensitive domains like healthcare.
  • [6] arXiv:1301.3781, 2013.
  • J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vec- tors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543. 11.10 Enhance the Performance of GLLMs for Non- English Languages The performance of GLLMs is not impressive in the case of non-English languages, especially in the case of languages with non-Latin scripts [131], [132], [363],
  • [7]         P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,” Transactions of the association for computational linguistics, vol. 5, pp. 135–146, 2017.
  • N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolu- tional neural network for modelling sentences,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2014, pp. 655–665.
  • [9]         H. Salehinejad, S. Sankar, J. Barfett, E. Colak, and S. Valaee, “Recent advances in recurrent neural networks,” arXiv preprint arXiv:1801.01078, 2017.
  • S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evalua- tion of gated recurrent neural networks on sequence modeling,” in NIPS 2014 Workshop on Deep Learning, December 2014, 2014.
  • I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” Advances in neural information processing systems, vol. 27, 2014.
  • D. Bahdanau, K. H. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in 3rd International Conference on Learning Representations, ICLR 2015, 2015.
  • M.-T. Luong, H. Pham, and C. D. Manning, “Effective ap- proaches to attention-based neural machine translation,” in Pro- ceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015, pp. 1412–1421.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classi- fication with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, 2012.
  • K. Simonyan and A. Zisserman, “Very deep convolutional net- works for large-scale image recognition,” in 3rd International Con- ference on Learning Representations (ICLR 2015). Computational and Biological Learning Society, 2015.
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
  • J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre- training of deep bidirectional transformers for language under- standing,” arXiv preprint arXiv:1810.04805, 2018.
  • A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Im- proving language understanding by generative pre-training.”
  • X. Liu, F. Zhang, Z. Hou, L. Mian, Z. Wang, J. Zhang, and
  • J. Tang, “Self-supervised learning: Generative or contrastive,” IEEE transactions on knowledge and data engineering, vol. 35, no. 1, pp. 857–876, 2021.
  • J. Gui, T. Chen, Q. Cao, Z. Sun, H. Luo, and D. Tao, “A survey of self-supervised learning from multiple perspectives: Algo- rithms, theory, applications and future trends,” arXiv preprint arXiv:2301.05712, 2023.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language understanding,” Advances in neural information processing systems, vol. 32, 2019.
  • K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “Electra: Pre- training text encoders as discriminators rather than generators,” in International Conference on Learning Representations, 2019.
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language representations,” in International Conference on Learning Representations, 2019.
  • P. He, J. Gao, and W. Chen, “Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing,” in The Eleventh International Conference on Learning Representations, 2022.
  • P. He, X. Liu, J. Gao, and W. Chen, “Deberta: Decoding-enhanced bert with disentangled attention,” in International Conference on Learning Representations, 2020.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “Bart: Denoising sequence-to-sequence pre-training for natural language gener- ation, translation, and comprehension,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 7871–7880.
  • [31]      A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,” arXiv preprint arXiv:2204.02311, 2022.
  • [32]      J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark et al., “Training compute-optimal large language models,” arXiv preprint arXiv:2203.15556, 2022.
  • [33]      N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat et al., “Glam: Efficient scal- ing of language models with mixture-of-experts,” in International Conference on Machine Learning. PMLR, 2022, pp. 5547–5569.
  • [34]      R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kul- shreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y. Du et al., “Lamda: Language models for dialog applications,” arXiv preprint arXiv:2201.08239, 2022.
  • [35]      J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young et al., “Scaling language models: Methods, analysis & insights from training gopher,” arXiv preprint arXiv:2112.11446, 2021.
  • [36]      S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajbhandari, J. Casper, Z. Liu, S. Prabhumoye, G. Zerveas, V. Korthikanti et al., “Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model,” arXiv preprint arXiv:2201.11990, 2022.
  • T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilic´, D. Hesslow, R. Castagne´, A. S. Luccioni, F. Yvon, M. Galle´ et al., “Bloom: A 176b-parameter open-access multilingual language model,” arXiv preprint arXiv:2211.05100, 2022.
  • [38]      R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia, A. Poulton, V. Kerkez, and R. Stojnic, “Galac- tica: A large language model for science,” arXiv preprint arXiv:2211.09085, 2022.
  • S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin et al., “Opt: Open pre-trained transformer language models,” arXiv preprint arXiv:2205.01068, 2022.
  • H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozie`re, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
  • [41]      H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
  • [42]      OpenAI, “Gpt-4 technical report,” 2023.
  • [43]      S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg et al., “Sparks of artificial general intelligence: Early experiments with gpt-4,” arXiv preprint arXiv:2303.12712, 2023.
  • [44]      W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language models,” arXiv preprint arXiv:2303.18223, 2023.
  • [45]      Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, and Z. Sui, “A survey for in-context learning,” arXiv preprint arXiv:2301.00234, 2022.
  • Y. Chang, X. Wang, J. Wang, Y. Wu, K. Zhu, H. Chen, L. Yang, X. Yi, C. Wang, Y. Wang et al., “A survey on evaluation of large language models,” arXiv preprint arXiv:2307.03109, 2023.
  • [47]      Z. Zhuang, Q. Chen, L. Ma, M. Li, Y. Han, Y. Qian, H. Bai, Z. Feng, W. Zhang, and T. Liu, “Through the lens of core competency: Survey on evaluation of large language models,” arXiv preprint arXiv:2308.07902, 2023.
  • [48]      Y. Wang, W. Zhong, L. Li, F. Mi, X. Zeng, W. Huang, L. Shang, X. Jiang, and Q. Liu, “Aligning large language models with human: A survey,” arXiv preprint arXiv:2307.12966, 2023.
  • [49]      Y. Liu, Y. Yao, J.-F. Ton, X. Zhang, R. G. H. Cheng, Y. Klochkov, M. F. Taufiq, and H. Li, “Trustworthy llms: a survey and guide- line for evaluating large language models’ alignment,” arXiv preprint arXiv:2308.05374, 2023.
  • [50]      X. Huang, W. Ruan, W. Huang, G. Jin, Y. Dong, C. Wu, S. Bensalem, R. Mu, Y. Qi, X. Zhao et al., “A survey of safety and trustworthiness of large language models through the lens of verification and validation,” arXiv preprint arXiv:2305.11391, 2023.
  • [51]      J. Huang and K. C.-C. Chang, “Towards reasoning in large language models: A survey,” arXiv preprint arXiv:2212.10403, 2022.
  • J. Kaddour, J. Harris, M. Mozes, H. Bradley, R. Raileanu, and R. McHardy, “Challenges and applications of large language models,” arXiv preprint arXiv:2307.10169, 2023.
  • [53]      X. Zhu, J. Li, Y. Liu, C. Ma, and W. Wang, “A survey on model compression for large language models,” arXiv preprint arXiv:2308.07633, 2023.
  • S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen, “A survey on multimodal large language models,” arXiv preprint arXiv:2306.13549, 2023.
  • T. Young, D. Hazarika, S. Poria, and E. Cambria, “Recent trends in deep learning based natural language processing,” ieee Com- putational intelligenCe magazine, vol. 13, no. 3, pp. 55–75, 2018.
  • [56]      Y. Kim, “Convolutional neural networks for sentence classifica- tion,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Com- putational Linguistics, 2014.
  • [57]      K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase rep- resentations using rnn encoder–decoder for statistical machine translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2014, p. 1724.
  • [58]      F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, and Q. He, “A comprehensive survey on transfer learning,” Proceedings of the IEEE, vol. 109, no. 1, pp. 43–76, 2020.
  • [59]      D. W. Otter, J. R. Medina, and J. K. Kalita, “A survey of the usages of deep learning for natural language processing,” IEEE transactions on neural networks and learning systems, vol. 32, no. 2, pp. 604–624, 2020.
  • [60]      X. Han, Z. Zhang, N. Ding, Y. Gu, X. Liu, Y. Huo, J. Qiu, Y. Yao,
  • Zhang, L. Zhang et al., “Pre-trained models: Past, present and future,” AI Open, vol. 2, pp. 225–250, 2021.
  • [61]      S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2009.
  • J. Blitzer, M. Dredze, and F. Pereira, “Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification,” in Proceedings of the 45th annual meeting of the association of computational linguistics, 2007, pp. 440–447.
  • [63]      J. E. Van Engelen and H. H. Hoos, “A survey on semi-supervised learning,” Machine learning, vol. 109, no. 2, pp. 373–440, 2020.
  • [64]      Y. Zhang and Q. Yang, “A survey on multi-task learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 12, pp. 5586–5609, 2021.
  • [65]      J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
  • [66]      M. Pagliardini, P. Gupta, and M. Jaggi, “Unsupervised learning of sentence embeddings using compositional n-gram features,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018, pp. 528–540.
  • [67]      M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics, Jun. 2018, pp. 2227–2237. [Online]. Available: https://aclanthology.org/N18-1202
  • [68]      R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen et al., “Palm 2 technical report,” arXiv preprint arXiv:2305.10403, 2023.
  • [69]      T. Lin, Y. Wang, X. Liu, and X. Qiu, “A survey of transformers,” AI Open, 2022.
  • [70]      X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang, “Pre-trained models for natural language processing: A survey,” Science China Technological Sciences, vol. 63, no. 10, pp. 1872–1897, 2020.
  • [71]      Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A ro- bustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
  • [72]      J. Zhang, Y. Zhao, M. Saleh, and P. Liu, “Pegasus: Pre-training with extracted gap-sentences for abstractive summarization,” in International Conference on Machine Learning. PMLR, 2020, pp. 11 328–11 339.
  • S. Doddapaneni, G. Ramesh, M. M. Khapra, A. Kunchukuttan, and P. Kumar, “A primer on pretrained multilingual language models,” arXiv preprint arXiv:2107.00676, 2021.
  • [74]      L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Sid- dhant, A. Barua, and C. Raffel, “mt5: A massively multilingual pre-trained text-to-text transformer,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 483–498.
  • Y. Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis, and L. Zettlemoyer, “Multilingual denoising pre- training for neural machine translation,” Transactions of the Asso- ciation for Computational Linguistics, vol. 8, pp. 726–742, 2020.
  • [76]      D. Kakwani, A. Kunchukuttan, S. Golla, N. Gokul, A. Bhat- tacharyya, M. M. Khapra, and P. Kumar, “Indicnlpsuite: Mono- lingual corpora, evaluation benchmarks and pre-trained multi- lingual language models for indian languages,” in Findings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 4948–4961.
  • A. Conneau and G. Lample, “Cross-lingual language model pretraining,” Advances in neural information processing systems, vol. 32, 2019.
  • A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wen- zek, F. Guzma´n, E´ . Grave, M. Ott, L. Zettlemoyer, and V. Stoy- anov, “Unsupervised cross-lingual representation learning at scale,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 8440–8451.
  • [79]      D. Q. Nguyen, T. Vu, and A. T. Nguyen, “Bertweet: A pre-trained language model for english tweets,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2020, pp. 9–14.
  • [80]      F. Barbieri, J. Camacho-Collados, L. E. Anke, and L. Neves, “Tweeteval: Unified benchmark and comparative evaluation for tweet classification,” in Findings of the Association for Computa- tional Linguistics: EMNLP 2020, 2020, pp. 1644–1650.
  • [81]      Y. Yang, M. C. S. Uy, and A. Huang, “Finbert: A pretrained language model for financial communications,” arXiv preprint arXiv:2006.08097, 2020.
  • D. Araci, “Finbert: Financial sentiment analysis with pre-trained language models,” arXiv preprint arXiv:1908.10063, 2019.
  • [83]      Z. Liu, D. Huang, K. Huang, Z. Li, and J. Zhao, “Finbert: A pre- trained financial language representation model for financial text mining,” in Proceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence, 2021, pp. 4513–4519.
  • I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras, and I. Androutsopoulos, “Legal-bert: The muppets straight out of law school,” in Findings of the Association for Computational Lin- guistics: EMNLP 2020, 2020, pp. 2898–2904.
  • [85]      S. Leivaditi, J. Rossi, and E. Kanoulas, “A benchmark for lease contract review,” arXiv preprint arXiv:2010.10386, 2020.
  • [86]      Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou,
  • Qin, T. Liu, D. Jiang et al., “Codebert: A pre-trained model for programming and natural languages,” in Findings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 1536–1547.
  • Y. Wang, W. Wang, S. Joty, and S. C. Hoi, “Codet5: Identifier- aware unified pre-trained encoder-decoder models for code un- derstanding and generation,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 8696–8708.
  • Y. Wang, H. Le, A. D. Gotmare, N. D. Bui, J. Li, and S. C. Hoi, “Codet5+: Open code large language models for code understanding and generation,” arXiv preprint arXiv:2305.07922, 2023.
  • J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang, “Biobert: a pre-trained biomedical language representation model for biomedical text mining,” Bioinformatics, vol. 36, no. 4, pp. 1234–1240, 2020.
  • Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, and H. Poon, “Domain-specific language model pretraining for biomedical natural language processing,” arXiv preprint arXiv:2007.15779, 2020.
  • K. raj Kanakarajan, B. Kundumani, and M. Sankarasubbu, “Bio- electra: pretrained biomedical text encoder using discrimina- tors,” in Proceedings of the 20th Workshop on Biomedical Language Processing, 2021, pp. 143–154.
  • V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter,” arXiv preprint arXiv:1910.01108, 2019.
  • X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu, “Tinybert: Distilling bert for natural language understanding,” in Findings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 4163–4174.
  • Z. Sun, H. Yu, X. Song, R. Liu, Y. Yang, and D. Zhou, “Mobile- bert: a compact task-agnostic bert for resource-limited devices,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 2158–2170.
  • W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, and M. Zhou, “Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers,” arXiv preprint arXiv:2002.10957, 2020.
  • I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long- document transformer,” arXiv preprint arXiv:2004.05150, 2020.
  • M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang et al., “Big bird: Transformers for longer sequences.” in NeurIPS, 2020.
  • F. Liu, E. Shareghi, Z. Meng, M. Basaldella, and N. Collier, “Self- alignment pretraining for biomedical entity representations,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 4228–4238.
  • G. Michalopoulos, Y. Wang, H. Kaka, H. Chen, and A. Wong, “Umlsbert: Clinical domain knowledge augmentation of con- textual embeddings using the unified medical language system metathesaurus,” arXiv preprint arXiv:2010.10391, 2020.
  • B. Goertzel, “Artificial general intelligence: concept, state of the art, and future prospects,” Journal of Artificial General Intelligence, vol. 5, no. 1, p. 1, 2014.
  • J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler et al., “Emergent abilities of large language models,” Transactions on Machine Learning Research, 2022.
  • R. Schaeffer, B. Miranda, and S. Koyejo, “Are emergent abil- ities of large language models a mirage?” arXiv preprint arXiv:2304.15004, 2023.
  • M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Eval- uating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021.
  • Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago et al., “Competition-level code generation with alphacode,” Science, vol. 378, no. 6624, pp. 1092–1097, 2022.
  • A. Glaese, N. McAleese, M. Trebacz, J. Aslanides, V. Firoiu, T. Ewalds, M. Rauh, L. Weidinger, M. Chadwick, P. Thacker et al., “Improving alignment of dialogue agents via targeted human judgements,” arXiv preprint arXiv:2209.14375, 2022.
  • S. Wang, Y. Sun, Y. Xiang, Z. Wu, S. Ding, W. Gong, S. Feng, J. Shang, Y. Zhao, C. Pang et al., “Ernie 3.0 titan: Exploring larger-scale knowledge enhanced pre-training for language un- derstanding and generation,” arXiv preprint arXiv:2112.12731, 2021.
  • O. Lieber, O. Sharir, B. Lenz, and Y. Shoham, “Jurassic-1: Techni- cal details and evaluation,” White Paper. AI21 Labs, vol. 1, 2021.
  • S.  Soltan,  S.  Ananthakrishnan,  J.  FitzGerald,  R.  Gupta, W. Hamza, H. Khan, C. Peris, S. Rawls, A. Rosenbaum, A. Rumshisky et al., “Alexatm 20b: Few-shot learning us- ing a large-scale multilingual seq2seq model,” arXiv preprint arXiv:2208.01448, 2022.
  • S. Iyer, X. V. Lin, R. Pasunuru, T. Mihaylov, D. Simig, P. Yu, K. Shuster, T. Wang, Q. Liu, P. S. Koura et al., “Opt-iml: Scaling language model instruction meta learning through the lens of generalization,” arXiv preprint arXiv:2212.12017, 2022.
  • N. Muennighoff, T. Wang, L. Sutawika, A. Roberts, S. Biderman, T. L. Scao, M. S. Bari, S. Shen, Z.-X. Yong, H. Schoelkopf et al., “Crosslingual generalization through multitask finetun- ing,” arXiv preprint arXiv:2211.01786, 2022.
  • [111]   N. Sengupta, S. K. Sahu, B. Jia, S. Katipomu, H. Li, F. Koto, O. M. Afzal, S. Kamboj, O. Pandit, R. Pal et al., “Jais and jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models,” arXiv preprint arXiv:2308.16149, 2023.
  • A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y. Xu, W. Zheng, X. Xia et al., “Glm-130b: An open bilingual pre-trained model,” in The Eleventh International Conference on Learning Representations, 2022.
  • X. Li, Y. Yao, X. Jiang, X. Fang, X. Meng, S. Fan, P. Han, J. Li, L. Du, B. Qin et al., “Flm-101b: An open llm and how to train it with 100 k budget,” arXiv preprint arXiv:2309.03852, 2023.
  • H. Yang, X.-Y. Liu, and C. D. Wang, “Fingpt: Open-source financial large language models,” arXiv preprint arXiv:2306.06031, 2023.
  • S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze, S. Gehrmann, P. Kambadur, D. Rosenberg, and G. Mann, “Bloomberggpt: A large language model for finance,” arXiv preprint arXiv:2303.17564, 2023.
  • K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl et al., “Large language models encode clinical knowledge,” Nature, pp. 1–9, 2023.
  • K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, L. Hou, K. Clark, S. Pfohl, H. Cole-Lewis, D. Neal et al., “Towards expert- level medical question answering with large language models,” arXiv preprint arXiv:2305.09617, 2023.
  • R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. Chim et al., “Starcoder: may the source be with you!” arXiv preprint arXiv:2305.06161, 2023.
  • B. Rozie`re, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin et al., “Code llama: Open foundation models for code,” arXiv preprint arXiv:2308.12950, 2023.
  • E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong, “Codegen: An open large language model for code with multi-turn program synthesis,” in The Eleventh International Conference on Learning Representations, 2022.
  • E. Nijkamp, H. Hayashi, C. Xiong, S. Savarese, and Y. Zhou, “Codegen2: Lessons for training llms on programming and natural languages,” arXiv preprint arXiv:2305.02309, 2023.
  • A. Radford, R. Jozefowicz, and I. Sutskever, “Learning to generate reviews and discovering sentiment,” arXiv preprint arXiv:1704.01444, 2017.
  • A. M. Dai and Q. V. Le, “Semi-supervised sequence learning,” Advances in neural information processing systems, vol. 28, 2015.
  • J. Howard and S. Ruder, “Universal language model fine-tuning for text classification,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 328–339.
  • B. Zhang, X. Fu, D. Ding, H. Huang, Y. Li, and L. Jing, “Inves- tigating chain-of-thought with chatgpt for stance detection on social media,” arXiv preprint arXiv:2304.03087, 2023.
  • B. Lamichhane, “Evaluation of chatgpt for nlp-based mental health applications,” arXiv preprint arXiv:2303.15727, 2023.
  • K. Yang, S. Ji, T. Zhang, Q. Xie, and S. Ananiadou, “On the evaluations of chatgpt and emotion-enhanced prompting for mental health analysis,” arXiv preprint arXiv:2304.03347, 2023.
  • Z. Wang, Q. Xie, Z. Ding, Y. Feng, and R. Xia, “Is chatgpt a good sentiment analyzer? a preliminary study,” arXiv preprint arXiv:2304.04339, 2023.
  • A. Lopez-Lira and Y. Tang, “Can chatgpt forecast stock price movements? return predictability and large language models,” arXiv preprint arXiv:2304.07619, 2023.
  • C. Ziems, W. Held, O. Shaikh, J. Chen, Z. Zhang, and D. Yang, “Can large language models transform computational social science?” arXiv preprint arXiv:2305.03514, 2023.
  • T. Kuzman, N. Ljubesˇic´, and I. Mozeticˇ, “Chatgpt: Beginning of an end of manual annotation? use case of automatic genre identification,” arXiv preprint arXiv:2303.03953, 2023.
  • Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su, B. Wilie, H. Love- nia, Z. Ji, T. Yu, W. Chung et al., “A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity,” arXiv preprint arXiv:2302.04023, 2023.
  • J. Kocon´, I. Cichecki, O. Kaszyca, M. Kochanek, D. Szydło, J. Baran, J. Bielaniewicz, M. Gruza, A. Janz, K. Kanclerz et al., “Chatgpt: Jack of all trades, master of none,” arXiv preprint arXiv:2302.10724, 2023.
  • [134]   Q. Zhong, L. Ding, J. Liu, B. Du, and D. Tao, “Can chatgpt understand too? a comparative study on chatgpt and fine-tuned bert,” arXiv preprint arXiv:2302.10198, 2023.
  • [135]   J. Ye, X. Chen, N. Xu, C. Zu, Z. Shao, S. Liu, Y. Cui, Z. Zhou,
  • Gong, Y. Shen et al., “A comprehensive capability analysis of gpt-3 and gpt-3.5 series models,” arXiv preprint arXiv:2303.10420, 2023.
  • X. Li, X. Zhu, Z. Ma, X. Liu, and S. Shah, “Are chatgpt and gpt-4 general-purpose solvers for financial text analytics? an examina- tion on several typical tasks,” arXiv preprint arXiv:2305.05862, 2023.
  • Z. Wu, L. Zhang, C. Cao, X. Yu, H. Dai, C. Ma, Z. Liu, L. Zhao, G. Li, W. Liu et al., “Exploring the trade-offs: Unified large language models vs local fine-tuned models for highly-specific radiology nli task,” arXiv preprint arXiv:2304.09138, 2023.
  • [138]   Y. Wang, Y. Zhao, and L. Petzold, “Are large language models ready for healthcare? a comparative study on clinical language understanding,” arXiv preprint arXiv:2304.05368, 2023.
  • [139]   K.-L. Chiu, A. Collins, and R. Alexander, “Detecting hate speech with gpt-3,” arXiv preprint arXiv:2103.12407, 2021.
  • [140]   F. Huang, H. Kwak, and J. An, “Is chatgpt better than human annotators? potential and limitations of chatgpt in explaining implicit hate speech,” arXiv preprint arXiv:2302.07736, 2023.
  • [141]   S. Chen, Y. Li, S. Lu, H. Van, H. J. Aerts, G. K. Savova, and D. S. Bitterman, “Evaluation of chatgpt family of models for biomedi- cal reasoning and classification,” arXiv preprint arXiv:2304.02496, 2023.
  • M. M. Amin, E. Cambria, and B. W. Schuller, “Will affective computing emerge from foundation models and general ai? a first evaluation on chatgpt,” IEEE Intelligent Systems, vol. 38, p. 2.
  • [143]   S. Parikh, Q. Vohra, P. Tumbade, and M. Tiwari, “Exploring zero and few-shot techniques for intent classification,” arXiv preprint arXiv:2305.07157, 2023.
  • X. Sun, X. Li, J. Li, F. Wu, S. Guo, T. Zhang, and G. Wang, “Text classification via large language models,” arXiv preprint arXiv:2305.08377, 2023.
  • Q. Li, H. Peng, J. Li, C. Xia, R. Yang, L. Sun, P. S. Yu, and L. He, “A survey on text classification: From traditional to deep learning,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 13, no. 2, pp. 1–41, 2022.
  • [146]   C.-E. Gonza´lez-Gallardo, E. Boros, N. Girdhar, A. Hamdi, J. G. Moreno, and A. Doucet, “Yes but.. can chatgpt identify entities in historical documents?” arXiv preprint arXiv:2303.17322, 2023.
  • [147]   Y. Hu, I. Ameer, X. Zuo, X. Peng, Y. Zhou, Z. Li, Y. Li, J. Li, X. Jiang, and H. Xu, “Zero-shot clinical entity recognition using chatgpt,” arXiv preprint arXiv:2303.16416, 2023.
  • [148]   X. Wei, X. Cui, N. Cheng, X. Wang, X. Zhang, S. Huang, P. Xie, J. Xu, Y. Chen, M. Zhang et al., “Zero-shot information extraction via chatting with chatgpt,” arXiv preprint arXiv:2302.10205, 2023.
  • [149]   B. J. Gutie´rrez, N. McNeal, C. Washington, Y. Chen, L. Li, H. Sun, and Y. Su, “Thinking about gpt-3 in-context learning for biomedical ie? think again,” in Findings of the Association for Computational Linguistics: EMNLP 2022, 2022, pp. 4497–4512.
  • [150]   J. Gao, H. Zhao, C. Yu, and R. Xu, “Exploring the feasibility of chatgpt for event extraction,” arXiv preprint arXiv:2303.03836, 2023.
  • H. Rehana, N. B. C¸ am, M. Basmaci, Y. He, A. O¨ zgu¨ r, and J. Hur, “Evaluation of gpt and bert-based models on identifying protein-protein interactions in biomedical text,” arXiv preprint arXiv:2303.17728, 2023.
  • C. Yuan, Q. Xie, and S. Ananiadou, “Zero-shot temporal relation extraction with chatgpt,” arXiv preprint arXiv:2304.05454, 2023.
  • [153]   B. Li, G. Fang, Y. Yang, Q. Wang, W. Ye, W. Zhao, and S. Zhang, “Evaluating chatgpt’s information extraction capabilities: An assessment of performance, explainability, calibration, and faith- fulness,” arXiv preprint arXiv:2304.11633, 2023.
  • [154]   C. Chan, J. Cheng, W. Wang, Y. Jiang, T. Fang, X. Liu, and Y. Song, “Chatgpt evaluation on sentence level relations: A focus on temporal, causal, and discourse relations,” arXiv preprint arXiv:2304.14827, 2023.
  • X. Xu, Y. Zhu, X. Wang, and N. Zhang, “How to unleash the power of large language models for few-shot relation extrac- tion?” arXiv preprint arXiv:2305.01555, 2023.
  • [156]   Z. Wan, F. Cheng, Z. Mao, Q. Liu, H. Song, J. Li, and S. Kuro- hashi, “Gpt-re: In-context learning for relation extraction using large language models,” arXiv preprint arXiv:2305.02105, 2023.
  • [157]   C. Qin, A. Zhang, Z. Zhang, J. Chen, M. Yasunaga, and D. Yang, “Is chatgpt a general-purpose natural language processing task solver?” arXiv preprint arXiv:2302.06476, 2023.
  • [158]   Y. Ma, Y. Cao, Y. Hong, and A. Sun, “Large language model is not a good few-shot information extractor, but a good reranker for hard samples!” arXiv preprint arXiv:2303.08559, 2023.
  • [159]   S. Wang, X. Sun, X. Li, R. Ouyang, F. Wu, T. Zhang, J. Li, and G. Wang, “Gpt-ner: Named entity recognition via large language models,” arXiv preprint arXiv:2304.10428, 2023.
  • [160]   D. Stammbach, M. Antoniak, and E. Ash, “Heroes, villains, and victims, and gpt-3: Automated extraction of character roles without training data,” in Proceedings of the 4th Workshop of Narrative Understanding (WNU2022), 2022, pp. 47–56.
  • [161]   S. Wadhwa, S. Amir, and B. C. Wallace, “Revisiting relation extraction in the era of large language models,” arXiv preprint arXiv:2305.05003, 2023.
  • P. Li, T. Sun, Q. Tang, H. Yan, Y. Wu, X. Huang, and X. Qiu, “Codeie: Large code generation models are better few-shot in- formation extractors,” arXiv preprint arXiv:2305.05711, 2023.
  • [163]   K. Zhang, B. J. Gutie´rrez, and Y. Su, “Aligning instruction tasks unlocks large language models as zero-shot relation extractors,” arXiv preprint arXiv:2305.11159, 2023.
  • [164]   Y. Lu, Q. Liu, D. Dai, X. Xiao, H. Lin, X. Han, L. Sun, and H. Wu, “Unified structure generation for universal information extraction,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 5755–5772.
  • [165]   Y. Chen, J. Cheng, H. Jiang, L. Liu, H. Zhang, S. Shi, and R. Xu, “Learning from sibling mentions with scalable graph inference in fine-grained entity typing,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 2076–2087.
  • [166]   S. S. S. Das, A. Katiyar, R. J. Passonneau, and R. Zhang, “Con- tainer: Few-shot named entity recognition via contrastive learn- ing,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 6338–6353.
  • S. Wu and Y. He, “Enriching pre-trained language model with entity information for relation classification,” in Proceedings of the 28th ACM international conference on information and knowledge management, 2019, pp. 2361–2364.
  • [168]   D. Ye, Y. Lin, P. Li, and M. Sun, “Packed levitated marker for entity and relation extraction,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 4904–4917.
  • [169]   K. Zhao, X. Jin, L. Bai, J. Guo, and X. Cheng, “Knowledge- enhanced self-supervised prototypical network for few-shot event detection,” in Findings of the Association for Computational Linguistics: EMNLP 2022, 2022, pp. 6266–6275.
  • [170]   Y. Ma, Z. Wang, Y. Cao, M. Li, M. Chen, K. Wang, and J. Shao, “Prompt for extraction? paie: Prompting argument interaction for event argument extraction,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 6759–6774.
  • [171]   X. Du and C. Cardie, “Event extraction by answering (almost) natural questions,” in Proceedings of the 2020 Conference on Empir- ical Methods in Natural Language Processing (EMNLP), 2020, pp. 671–683.
  • Z. Zhao, E. Wallace, S. Feng, D. Klein, and S. Singh, “Calibrate before use: Improving few-shot performance of language mod- els,” in International Conference on Machine Learning. PMLR, 2021, pp. 12 697–12 706.
  • [173]   Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi, A. Mirzaei, A. Naik, A. Ashok, A. S. Dhanasekaran, A. Arunkumar, D. Stap et al., “Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 5085–5109.
  • [174]   M. Zaib, W. E. Zhang, Q. Z. Sheng, A. Mahmood, and Y. Zhang, “Conversational question answering: A survey,” Knowledge and Information Systems, vol. 64, no. 12, pp. 3151–3195, 2022.
  • [175]   Y. Chali, S. A. Hasan, and S. R. Joty, “Improving graph-based random walks for complex question answering using syntactic, shallow semantic and extended string subsequence kernels,” Information Processing & Management, vol. 47, no. 6, pp. 843–855, 2011.
  • [176]   A. Torfi, R. A. Shirvani, Y. Keneshloo, N. Tavaf, and E. A. Fox, “Natural language processing advancements by deep learning: A survey,” arXiv preprint arXiv:2003.01200, 2020.
  • D. Nunes, R. Primi, R. Pires, R. Lotufo, and R. Nogueira, “Evaluating gpt-3.5 and gpt-4 models on brazilian university admission exams,” arXiv preprint arXiv:2303.17003, 2023.
  • Y. Tan, D. Min, Y. Li, W. Li, N. Hu, Y. Chen, and G. Qi, “Evalu- ation of chatgpt as a question answering system for answering complex questions,” arXiv preprint arXiv:2303.07992, 2023.
  • Z. Yang, Z. Gan, J. Wang, X. Hu, Y. Lu, Z. Liu, and L. Wang, “An empirical study of gpt-3 for few-shot knowledge-based vqa,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, 2022, pp. 3081–3089.
  • P. Srivastava, T. Ganu, and S. Guha, “Towards zero-shot and few-shot table question answering using gpt-3,” arXiv preprint arXiv:2210.17284, 2022.
  • S. Zheng, J. Huang, and K. C.-C. Chang, “Why does chatgpt fall short in answering questions faithfully?” arXiv preprint arXiv:2304.10513, 2023.
  • J. S. Samaan, Y. H. Yeo, N. Rajeev, L. Hawley, S. Abel, W. H. Ng, N. Srinivasan, J. Park, M. Burch, R. Watson et al., “Assessing the accuracy of responses by the language model chatgpt to questions regarding bariatric surgery,” Obesity surgery, pp. 1–7, 2023.
  • J. Holmes, Z. Liu, L. Zhang, Y. Ding, T. T. Sio, L. A. McGee, J. B. Ashman, X. Li, T. Liu, J. Shen et al., “Evaluating large lan- guage models on a highly-specialized topic, radiation oncology physics,” Frontiers in Oncology, vol. 13, p. 1219326.
  • I. Joshi, R. Budhiraja, H. Dev, J. Kadia, M. O. Ataullah, S. Mitra,
  • Kumar, and H. D. Akolekar, “Chatgpt–a blessing or a curse for undergraduate computer science students and instructors?” arXiv preprint arXiv:2304.14993, 2023.
  • H. Nori, N. King, S. M. McKinney, D. Carignan, and E. Horvitz, “Capabilities of gpt-4 on medical challenge problems,” arXiv preprint arXiv:2303.13375, 2023.
  • A. Hamidi and K. Roberts, “Evaluation of ai chatbots for patient- specific ehr questions,” arXiv preprint arXiv:2306.02549, 2023.
  • J. Savelka, A. Agarwal, C. Bogart, and M. Sakr, “Large language models (gpt) struggle to answer multiple-choice questions about code,” arXiv preprint arXiv:2303.08033, 2023.
  • M. Bommarito II and D. M. Katz, “Gpt takes the bar exam,” arXiv preprint arXiv:2212.14402, 2022.
  • J. Pereira, R. Fidalgo, R. Lotufo, and R. Nogueira, “Visconde: Multi-document qa with gpt-3 and neural reranking,” in Eu- ropean Conference on Information Retrieval. Springer, 2023, pp. 534–543.
  • R. Gupta, I. Herzog, J. B. Park, J. Weisberger, P. Firouzbakht, V. Ocon, J. Chao, E. S. Lee, and B. A. Mailey, “Performance of chatgpt on the plastic surgery inservice training examination,” Aesthetic surgery journal, p. sjad128, 2023.
  • Y. Tanaka, T. Nakata, K. Aiga, T. Etani, R. Muramatsu, S. Katagiri, H. Kawai, F. Higashino, M. Enomoto, M. Noda et al., “Per- formance of generative pretrained transformer on the national medical licensing examination in japan,” medRxiv, pp. 2023–04, 2023.
  • J. Robinson and D. Wingate, “Leveraging large language models for multiple choice question answering,” in The Eleventh Interna- tional Conference on Learning Representations, 2022.
  • Y. Weng, B. Li, F. Xia, M. Zhu, B. Sun, S. He, K. Liu, and J. Zhao, “Large language models need holistically thought in medical conversational qa,” arXiv preprint arXiv:2305.05410, 2023.
  • S. Lin, J. Hilton, and O. Evans, “Truthfulqa: Measuring how models mimic human falsehoods,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 3214–3252.
  • J. Kasai, Y. Kasai, K. Sakaguchi, Y. Yamada, and D. Radev, “Evaluating gpt-4 and chatgpt on japanese medical licensing examinations,” arXiv preprint arXiv:2303.18027, 2023.
  • W. Gu, “Linguistically informed chatgpt prompts to enhance japanese-chinese machine translation: A case study on attribu- tive clauses,” arXiv preprint arXiv:2303.15587, 2023.
  • K. Peng, L. Ding, Q. Zhong, L. Shen, X. Liu, M. Zhang, Y. Ouyang, and D. Tao, “Towards making the most of chatgpt for machine translation,” arXiv preprint arXiv:2303.13780, 2023.
  • W. Jiao, W. Wang, J. Huang, X. Wang, and Z. Tu, “Is chatgpt a good translator? yes with gpt-4 as the engine,” arXiv preprint arXiv:2301.08745, 2023.
  • [199]   A. Hendy, M. Abdelrehim, A. Sharaf, V. Raunak, M. Gabr, H. Matsushita, Y. J. Kim, M. Afify, and H. H. Awadalla, “How good are gpt models at machine translation? a comprehensive evaluation,” arXiv preprint arXiv:2302.09210, 2023.
  • [200]   Y. Gao, R. Wang, and F. Hou, “How to design translation prompts for chatgpt: An empirical study,” arXiv e-prints, pp. arXiv–2304, 2023.
  • L. Wang, C. Lyu, T. Ji, Z. Zhang, D. Yu, S. Shi, and Z. Tu, “Document-level machine translation with large language mod- els,” arXiv preprint arXiv:2304.02210, 2023.
  • [202]   W. Zhu, H. Liu, Q. Dong, J. Xu, L. Kong, J. Chen, L. Li, and S. Huang, “Multilingual machine translation with large language models: Empirical results and analysis,” arXiv preprint arXiv:2304.04675, 2023.
  • C. Lyu, J. Xu, and L. Wang, “New trends in machine translation using large language models: Case examples with chatgpt,” arXiv preprint arXiv:2305.01181, 2023.
  • [204]   M. Karpinska and M. Iyyer, “Large language models effectively leverage document-level context for literary translation, but critical errors persist,” arXiv preprint arXiv:2304.03245, 2023.
  • [205]   Y. Moslem, R. Haque, and A. Way, “Adaptive machine translation with large language models,” arXiv preprint arXiv:2301.13294, 2023.
  • Z. He, T. Liang, W. Jiao, Z. Zhang, Y. Yang, R. Wang, Z. Tu, S. Shi, and X. Wang, “Exploring human-like translation strategy with large language models,” arXiv preprint arXiv:2305.04118, 2023.
  • [207]   V. Raunak, A. Sharaf, H. H. Awadallah, and A. Menezes, “Lever- aging gpt-4 for automatic translation post-editing,” arXiv preprint arXiv:2305.14878, 2023.
  • V. Raunak, A. Menezes, M. Post, and H. H. Awadallah, “Do gpts produce less literal translations?” arXiv preprint arXiv:2305.16806, 2023.
  • F. Stahlberg, “Neural machine translation: A review,” Journal of Artificial Intelligence Research, vol. 69, pp. 343–418, 2020.
  • S. Yang, Y. Wang, and X. Chu, “A survey of deep learn- ing techniques for neural machine translation,” ArXiv, vol. abs/2002.07526, 2020.
  • Z. Tan, S. Wang, Z. Yang, G. Chen, X. Huang, M. Sun, and Y. Liu, “Neural machine translation: A review of methods, resources, and tools,” AI Open, vol. 1, pp. 5–21, 2020.
  • [212]   D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine trans- lation by jointly learning to align and translate,” CoRR, vol. abs/1409.0473, 2014.
  • Y. Tang, C. Tran, X. Li, P.-J. Chen, N. Goyal, V. Chaudhary, J. Gu, and A. Fan, “Multilingual translation with extensible multilin- gual pretraining and finetuning,” arXiv preprint arXiv:2008.00401, 2020.
  • A. Fan, S. Bhosale, H. Schwenk, Z. Ma, A. El-Kishky, S. Goyal, M. Baines, O. C¸ elebi, G. Wenzek, V. Chaudhary, N. Goyal, T. Birch, V. Liptchinsky, S. Edunov, E. Grave, M. Auli, and A. Joulin, “Beyond english-centric multilingual machine trans- lation,” ArXiv, vol. abs/2010.11125, 2020.
  • M. R. Costa-jussa`, J. Cross, O. C¸ elebi, M. Elbayad, K. Heafield, K. Heffernan, E. Kalbassi, J. Lam, D. Licht, J. Maillard et al., “No language left behind: Scaling human-centered machine translation,” arXiv preprint arXiv:2207.04672, 2022.
  • [216]   R. Mart´ınez-Cruz, A. J. Lo´pez-Lo´pez, and J. Portela, “Chatgpt vs state-of-the-art models: A benchmarking study in keyphrase generation task,” arXiv preprint arXiv:2304.14177, 2023.
  • [217]   M. Song, H. Jiang, S. Shi, S. Yao, S. Lu, Y. Feng, H. Liu, and L. Jing, “Is chatgpt a good keyphrase generator? a preliminary study,” arXiv preprint arXiv:2303.13001, 2023.
  • [218]   W. Pan, Q. Chen, X. Xu, W. Che, and L. Qin, “A preliminary evaluation of chatgpt for zero-shot dialogue understanding,” arXiv preprint arXiv:2304.04256, 2023.
  • [219]   W. Zhao, Y. Zhao, X. Lu, S. Wang, Y. Tong, and B. Qin, “Is chatgpt equipped with emotional dialogue capabilities?” arXiv preprint arXiv:2304.09582, 2023.
  • B. Chintagunta, N. Katariya, X. Amatriain, and A. Kannan, “Medically aware gpt-3 as a data generator for medical dialogue summarization,” in Machine Learning for Healthcare Conference. PMLR, 2021, pp. 354–372.
  • [221]   G. P. Prodan and E. Pelican, “Prompt scoring system for dialogue summarization using gpt-3,” ACM Transaction on Audio, Speech, and Language Processing, pp. 1–9, 2022.
  • [222]   J. Huynh, C. Jiao, P. Gupta, S. Mehri, P. Bajaj, V. Chaudhary, and M. Eskenazi, “Understanding the effectiveness of very large language models on dialog evaluation,” arXiv preprint arXiv:2301.12004, 2023.
  • Y. Fan and F. Jiang, “Uncovering the potential of chatgpt for dis- course analysis in dialogue: An empirical study,” arXiv preprint arXiv:2305.08391, 2023.
  • H. Wang, R. Wang, F. Mi, Z. Wang, R. Xu, and K.-F. Wong, “Chain-of-thought prompting for responding to in-depth dia- logue questions with llm,” arXiv preprint arXiv:2305.11792, 2023.
  • [225]   R. Meng, X. Yuan, T. Wang, S. Zhao, A. Trischler, and D. He, “An empirical study on neural keyphrase generation,” in Proceedings of the 2021 Conference of the North American Chapter of the Associ- ation for Computational Linguistics: Human Language Technologies, 2021, pp. 4985–5007.
  • [226]   X. Yuan, T. Wang, R. Meng, K. Thaker, P. Brusilovsky, D. He, and A. Trischler, “One size does not fit all: Generating and evaluating variable number of keyphrases,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 7961–7975.
  • M. Kulkarni, D. Mahata, R. Arora, and R. Bhowmik, “Learning rich representation of keyphrases from text,” in Findings of the Association for Computational Linguistics: NAACL 2022. Seattle, United States: Association for Computational Linguistics, Jul. 2022, pp. 891–906. [Online]. Available: https://aclanthology. org/2022.findings-naacl.67
  • [228]   I. V. Serban, R. Lowe, P. Henderson, L. Charlin, and J. Pineau, “A survey of available corpora for building data-driven dialogue systems: The journal version,” Dialogue & Discourse, vol. 9, no. 1, pp. 1–49, 2018.
  • [229]   S. Larson and K. Leach, “A survey of intent classification and slot-filling datasets for task-oriented dialog,” arXiv preprint arXiv:2207.13211, 2022.
  • W. Sun, L. Yan, X. Ma, P. Ren, D. Yin, and Z. Ren, “Is chatgpt good at search? investigating large language models as re- ranking agent,” arXiv preprint arXiv:2304.09542, 2023.
  • [231]   N. Ziems, W. Yu, Z. Zhang, and M. Jiang, “Large language models are built-in autoregressive search engines,” arXiv preprint arXiv:2305.09612, 2023.
  • A. Anand, L. Lyu, M. Idahl, Y. Wang, J. Wallat, and Z. Zhang, “Explainable information retrieval: A survey,” arXiv preprint arXiv:2211.02405, 2022.
  • R. Nogueira, Z. Jiang, R. Pradeep, and J. Lin, “Document ranking with a pretrained sequence-to-sequence model,” in Findings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 708–718.
  • N. Thakur, N. Reimers, A. Ru¨ ckle´, A. Srivastava, and I. Gurevych, “Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models,” in Thirty-fifth Confer- ence on Neural Information Processing Systems Datasets and Bench- marks Track (Round 2), 2021.
  • W. X. Zhao, J. Liu, R. Ren, and J.-R. Wen, “Dense text retrieval based on pretrained language models: A survey,” arXiv preprint arXiv:2211.14876, 2022.
  • G. Adomavicius and A. Tuzhilin, “Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions,” IEEE transactions on knowledge and data engineering, vol. 17, no. 6, pp. 734–749, 2005.
  • [237]   Y. Peng, “A survey on modern recommendation system based on big data,” arXiv preprint arXiv:2206.02631, 2022.
  • [238]   F. Rezaimehr and C. Dadkhah, “A survey of attack detection approaches in collaborative filtering recommender systems,” Artificial Intelligence Review, vol. 54, pp. 2011–2066, 2021.
  • [239]   Y. Xie, J. Gao, P. Zhou, Q. Ye, Y. Hua, J. Kim, F. Wu, and S. Kim, “Rethinking multi-interest learning for candidate matching in recommender systems,” arXiv preprint arXiv:2302.14532, 2023.
  • [240]   M. Dong, X. Zeng, L. Koehl, and J. Zhang, “An interactive knowledge-based recommender system for fashion product de- sign in the big data environment,” Information Sciences, vol. 540, pp. 469–488, 2020.
  • [241]   Y. Gao, T. Sheng, Y. Xiang, Y. Xiong, H. Wang, and J. Zhang, “Chat-rec: Towards interactive and explainable llms-augmented recommender system,” arXiv preprint arXiv:2303.14524, 2023.
  • [242]   F. Zhu, Y. Wang, C. Chen, J. Zhou, L. Li, and G. Liu, “Cross- domain recommendation: challenges, progress, and prospects,” arXiv preprint arXiv:2103.01696, 2021.
  • [243]   L. Wang and E.-P. Lim, “Zero-shot next-item recommenda- tion using large pretrained language models,” arXiv preprint arXiv:2304.03153, 2023.
  • A. Zhiyuli, Y. Chen, X. Zhang, and X. Liang, “Bookgpt: A general framework for book recommendation empowered by large language model,” arXiv preprint arXiv:2305.15673, 2023.
  • [245]   J. Liu, C. Liu, R. Lv, K. Zhou, and Y. Zhang, “Is chatgpt a good recommender? a preliminary study,” arXiv preprint arXiv:2304.10149, 2023.
  • S. Dai, N. Shao, H. Zhao, W. Yu, Z. Si, C. Xu, Z. Sun, X. Zhang, and J. Xu, “Uncovering chatgpt’s capabilities in recommender systems,” arXiv preprint arXiv:2305.02182, 2023.
  • [247]   W.-C. Kang, J. Ni, N. Mehta, M. Sathiamoorthy, L. Hong,
  • Chi, and D. Z. Cheng, “Do llms understand user prefer- ences? evaluating llms on user rating prediction,” arXiv preprint arXiv:2305.06474, 2023.
  • J. Zhang, K. Bao, Y. Zhang, W. Wang, F. Feng, and X. He, “Is chat- gpt fair for recommendation? evaluating fairness in large lan- guage model recommendation,” arXiv preprint arXiv:2305.07609, 2023.
  • Y. Hou, J. Zhang, Z. Lin, H. Lu, R. Xie, J. McAuley, and W. X. Zhao, “Large language models are zero-shot rankers for recommender systems,” arXiv preprint arXiv:2305.08845, 2023.
  • [250]   S. Mysore, A. McCallum, and H. Zamani, “Large language model augmented narrative driven recommendations,” arXiv preprint arXiv:2306.02250, 2023.
  • [251]   C. S. Xia and L. Zhang, “Keep the conversation going: Fixing 162 out of 337 bugs for 0.42 each using chatgpt,” arXiv preprint arXiv:2304.00385, 2023.
  • A. Cheshkov, P. Zadorozhny, and R. Levichev, “Evaluation of chatgpt model for vulnerability detection,” arXiv preprint arXiv:2304.07232, 2023.
  • B. Yetis¸tiren, I. O¨ zsoy, M. Ayerdem, and E. Tu¨ zu¨ n, “Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt,” arXiv preprint arXiv:2304.10778, 2023.
  • [254]   T.-O. Li, W. Zong, Y. Wang, H. Tian, Y. Wang, and S.-C. Cheung, “Finding failure-inducing test cases with chatgpt,” arXiv preprint arXiv:2304.11686, 2023.
  • C. Liu, X. Bao, H. Zhang, N. Zhang, H. Hu, X. Zhang, and M. Yan, “Improving chatgpt prompt for code generation,” arXiv preprint arXiv:2305.08360, 2023.
  • [256]   R. A. Poldrack, T. Lu, and G. Begusˇ, “Ai-assisted coding: Exper- iments with gpt-4,” arXiv preprint arXiv:2304.13187, 2023.
  • [257]   J. Liu, C. S. Xia, Y. Wang, and L. Zhang, “Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,” arXiv preprint arXiv:2305.01210, 2023.
  • E. Chen, R. Huang, H.-S. Chen, Y.-H. Tseng, and L.-Y. Li, “Gptu- tor: a chatgpt-powered programming tool for code explanation,” arXiv preprint arXiv:2305.01863, 2023.
  • [259]   N. Nascimento, P. Alencar, and D. Cowan, “Comparing software developers with chatgpt: An empirical investigation,” arXiv preprint arXiv:2305.11837, 2023.
  • [260]   J. Y. Khan and G. Uddin, “Automatic code documentation generation using gpt-3,” in Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, 2022, pp. 1–6.
  • [261]   J. Leinonen, P. Denny, S. MacNeil, S. Sarsa, S. Bernstein, J. Kim, A. Tran, and A. Hellas, “Comparing code explanations created by students and large language models,” arXiv preprint arXiv:2304.03938, 2023.
  • X.-Y. Li, J.-T. Xue, Z. Xie, and M. Li, “Think outside the code: Brainstorming boosts large language models in code genera- tion,” arXiv preprint arXiv:2305.10679, 2023.
  • [263]   J. A. Prenner and R. Robbes, “Automatic program repair with openai’s codex: Evaluating quixbugs,” arXiv preprint arXiv:2111.03922, 2021.
  • M. L. Siddiq, J. C. S. Santos, R. H. Tanvir, N. Ulfat, F. A. Rifat, and V. C. Lopes, “Exploring the effectiveness of large language models in generating unit tests,” ArXiv, vol. abs/2305.00418, 2023.
  • H. Tian, W. Lu, T. O. Li, X. Tang, S.-C. Cheung, J. Klein, and T. F. Bissyande´, “Is chatgpt the ultimate programming assistant–how far is it?” arXiv preprint arXiv:2304.11938, 2023.
  • [266] M. Geng, S. Wang, D. Dong, H. Wang, G. Li, Z. Jin, X. Mao, and X. Liao, “An empirical study on using large language models for multi-intent comment generation,” ArXiv, vol. abs/2304.11384, 2023.
  • [267] S. Kang, B. Chen, S. Yoo, and J.-G. Lou, “Explainable automated debugging via large language model-driven scientific debug-ging,” arXiv preprint arXiv:2304.02195, 2023.
  • [268] A. Kashefi and T. Mukerji, “Chatgpt for programming numerical methods,” ArXiv, vol. abs/2303.12093, 2023.
  • [269] G. Destefanis, S. Bartolucci, and M. Ortu, “A preliminary analysis on the code generation capabilities of gpt-3.5 and bard ai models for java functions,” arXiv preprint arXiv:2305.09402, 2023.
  • [270] Z. Yuan, Y. Lou, M. Liu, S. Ding, K. Wang, Y. Chen, and X. Peng, “No more manual tests? evaluating and improving chatgpt for unit test generation,” ArXiv, vol. abs/2305.04207, 2023.
  • [271] T. Phung, V.-A. Padurean, J. P. Cambronero, S. Gulwani, T. Kohn, R. Majumdar, A. K. Singla, and G. Soares, “Generative ai for pro-gramming education: Benchmarking chatgpt, gpt-4, and human tutors,” ArXiv, vol. abs/2306.17156, 2023.
  • [272] X. Hou, Y. Zhao, Y. Liu, Z. Yang, K. Wang, L. Li, X. Luo, D. Lo, J. Grundy, and H. Wang, “Large language models for soft-ware engineering: A systematic literature review,” arXiv preprint arXiv:2308.10620, 2023.
  • [273] S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. Clement, D. Drain, D. Jiang, D. Tang et al., “Codexglue: A machine learning benchmark dataset for code understanding and generation,” in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021.
  • [274] L. Phan, H. Tran, D. Le, H. Nguyen, J. Annibal, A. Peltekian, and Y. Ye, “Cotext: Multi-task learning with code-text transformer,” in Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog 2021), 2021, pp. 40–47.
  • [275] D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, L. Shujie, L. Zhou, N. Duan, A. Svyatkovskiy, S. Fu et al., “Graphcodebert: Pre-training code representations with data flow,” in International Conference on Learning Representations, 2020.
  • [276] W. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, “Unified pre-training for program understanding and generation,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 2655–2668.
  • [277] D. Zan, B. Chen, D. Yang, Z. Lin, M. Kim, B. Guan, Y. Wang, W. Chen, and J.-G. Lou, “Cert: Continual pre-training on sketches for library-oriented code generation,” arXiv preprint arXiv:2206.06888, 2022.
  • [278] R. Just, D. Jalali, and M. D. Ernst, “Defects4j: A databaseof existing faults to enable controlled testing studies for java programs,” in Proceedings of the 2014 international symposium on software testing and analysis, 2014, pp. 437–440.
  • [279] D. Lin, J. Koppel, A. Chen, and A. Solar-Lezama, “Quixbugs: A multi-lingual program repair benchmark set based on the quixey challenge,” in Proceedings Companion of the 2017 ACM SIGPLAN international conference on systems, programming, languages, and applications: software for humanity, 2017, pp. 55–56.
  • [280] A. Sundar and L. Heck, “Multimodal conversational ai: A survey of datasets and approaches,” in Proceedings of the 4th Workshop on NLP for Conversational AI, 2022, pp. 131–147.
  • [281] P. Xu, X. Zhu, and D. A. Clifton, “Multimodal learning with transformers: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • [282] Z. Shao, Z. Yu, M. Wang, and J. Yu, “Prompting large language models with answer heuristics for knowledge-based visual question answering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 974–14 983.
  • [283] Y. Lin, Y. Xie, D. Chen, Y. Xu, C. Zhu, and L. Yuan, “Revive: Regional visual representation matters in knowledge-based visual question answering,” arXiv preprint arXiv:2206.01201, 2022.
  • [284] L. Gui, B. Wang, Q. Huang, A. G. Hauptmann, Y. Bisk, and J. Gao, “Kat: A knowledge augmented transformer for vision-and-language,” in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 956–968.
  • [285] Y. Lu, X. Yang, X. Li, X. E. Wang, and W. Y. Wang, “Llmscore: Unveiling the power of large language models in text-to-image synthesis evaluation,” arXiv preprint arXiv:2305.11116, 2023.
  • [286] W. Zhu, X. Wang, Y. Lu, T.-J. Fu, X. E. Wang, M. Eckstein, and W. Y. Wang, “Collaborative generative ai: Integrating gpt-kfor efficient editing in text-to-image generation,” arXiv preprint arXiv:2305.11317, 2023.
  • T. Zhang, Y. Zhang, V. Vineet, N. Joshi, and X. Wang, “Controllable text-to-image generation with gpt-4,” arXiv preprint arXiv:2305.18583, 2023.
  • S. Hong, J. Seo, S. Hong, H. Shin, and S. Kim, “Large language models are frame-level directors for zero-shot text-to-video generation,” arXiv preprint arXiv:2305.14330, 2023.
  • R. Huang, M. Li, D. Yang, J. Shi, X. Chang, Z. Ye, Y. Wu, Z. Hong, J. Huang, J. Liu et al., “Audiogpt: Understanding and generating speech, music, sound, and talking head,” arXiv preprint arXiv:2304.12995, 2023.
  • M. Ranjit, G. Ganapathy, R. Manuel, and T. Ganu, “Retrieval augmented chest x-ray report generation using openai gpt models,” arXiv preprint arXiv:2305.03660, 2023.
  • S. S. Kalakonda, S. Maheshwari, and R. K. Sarvadevabhatla, “Action-gpt: Leveraging large-scale language models for improved and generalized zero shot action generation,” arXiv preprint arXiv:2211.15603, 2022.
  • C. Wu, S. Yin, W. Qi, X. Wang, Z. Tang, and N. Duan, “Visual chatgpt: Talking, drawing and editing with visual foundation models,” arXiv preprint arXiv:2303.04671, 2023.
  • Z. Yang, L. Li, J. Wang, K. Lin, E. Azarnasab, F. Ahmed, Z. Liu, C. Liu, M. Zeng, and L. Wang, “Mm-react: Prompting chatgpt for multimodal reasoning and action,” arXiv preprint arXiv:2303.11381, 2023.
  • J. Li, H. Li, Z. Pan, and G. Pan, “Prompt chatgpt in mner: Improved multimodal named entity recognition method based on auxiliary refining knowledge from chatgpt,” arXiv preprint arXiv:2305.12212, 2023.
  • S. Hakimov and D. Schlangen, “Images in language space: Exploring the suitability of large language models for vision & language tasks,” arXiv preprint arXiv:2305.13782, 2023.
  • W. Feng, W. Zhu, T.-j. Fu, V. Jampani, A. Akula, X. He, S. Basu, X. E. Wang, and W. Y. Wang, “Layoutgpt: Compositional visual planning and generation with large language models,” arXiv preprint arXiv:2305.15393, 2023.
  • L. Fan, D. Krishnan, P. Isola, D. Katabi, and Y. Tian, “Improving clip training with language rewrites,” arXiv preprint arXiv:2305.20088, 2023.
  • C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, and J. Gao, “Llava-med: Training a large language-and-vision assistant for biomedicine in one day,” arXiv preprint arXiv:2306.00890, 2023.
  • A. Bhattacharya, Y. K. Singla, B. Krishnamurthy, R. R. Shah, and C. Chen, “A video is worth 4096 tokens: Verbalize story videos to understand them in zero shot,” arXiv preprint arXiv:2305.09758, 2023.
  • X. Mei, C. Meng, H. Liu, Q. Kong, T. Ko, C. Zhao, M. D. Plumbley, Y. Zou, and W. Wang, “Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research,” arXiv preprint arXiv:2303.17395, 2023.
  • D. Zhang, S. Li, X. Zhang, J. Zhan, P. Wang, Y. Zhou, and X. Qiu, “Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities,” arXiv preprint arXiv:2305.11000, 2023.
  • Z. Zhao, L. Guo, T. Yue, S. Chen, S. Shao, X. Zhu, Z. Yuan, and J. Liu, “Chatbridge: Bridging modalities with large language model as a language catalyst,” arXiv preprint arXiv:2305.16103, 2023.
  • M. Zheng, X. Su, S. You, F. Wang, C. Qian, C. Xu, and S. Albanie, “Can gpt-4 perform neural architecture search?” arXiv preprint arXiv:2304.10970, 2023.
  • Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang, “Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface,” arXiv preprint arXiv:2303.17580, 2023.
  • L. Zhang, Y. Zhang, K. Ren, D. Li, and Y. Yang, “Mlcopilot: Unleashing the power of large language models in solving machine learning tasks,” arXiv preprint arXiv:2304.14979, 2023.
  • S. Zhang, C. Gong, L. Wu, X. Liu, and M. Zhou, “Automl-gpt: Automatic machine learning with gpt,” arXiv preprint arXiv:2305.02499, 2023.
  • F. Hutter, L. Kotthoff, and J. Vanschoren, Automated machine learning: methods, systems, challenges.  Springer Nature, 2019.
  • A. Olmo, S. Sreedharan, and S. Kambhampati, “Gpt3-to-plan: Extracting plans from text using gpt-3,” arXiv preprint arXiv:2106.07131, 2021.
  • [309]   B. Zhang and H. Soh, “Large language models as zero-shot human models for human-robot interaction,” arXiv preprint arXiv:2303.03548, 2023.
  • Y. Xie, C. Yu, T. Zhu, J. Bai, Z. Gong, and H. Soh, “Translating natural language to planning goals with large-language models,” arXiv preprint arXiv:2302.05128, 2023.
  • H. Hu, H. Lu, H. Zhang, W. Lam, and Y. Zhang, “Chain-of- symbol prompting elicits planning in large langauge models,” arXiv preprint arXiv:2305.10276, 2023.
  • K. Valmeekam, A. Olmo, S. Sreedharan, and S. Kambhampati, “Large language models still can’t plan (a benchmark for llms on planning and reasoning about change),” in NeurIPS 2022 Foundation Models for Decision Making Workshop, 2022.
  • K. M. Collins, C. Wong, J. Feng, M. Wei, and J. B. Tenen- baum, “Structured, flexible, and robust: benchmarking and im- proving large language models towards more human-like be- havior in out-of-distribution reasoning tasks,” arXiv preprint arXiv:2205.05718, 2022.
  • K. Mahowald, A. A. Ivanova, I. A. Blank, N. Kanwisher, J. B. Tenenbaum, and E. Fedorenko, “Dissociating language and thought in large language models: a cognitive perspective,” arXiv preprint arXiv:2301.06627, 2023.
  • K. S. Kalyan and S. Sangeetha, “Medical concept normalization in user-generated texts by learning target concept embeddings,” in Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis, 2020, pp. 18–23.
  • ——, “Target concept guided medical concept normalization in noisy user-generated texts,” in Proceedings of Deep Learning Inside Out (DeeLIO): The First Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, 2020, pp. 64–73.
  • J. Holmes, Z. Liu, L. Zhang, Y. Ding, T. T. Sio, L. A. McGee, J. B. Ashman, X. Li, T. Liu, J. Shen et al., “Evaluating large lan- guage models on a highly-specialized topic, radiation oncology physics,” arXiv preprint arXiv:2304.01938, 2023.
  • Z. Liu, X. Yu, L. Zhang, Z. Wu, C. Cao, H. Dai, L. Zhao, W. Liu, D. Shen, Q. Li et al., “Deid-gpt: Zero-shot medical text de- identification by gpt-4,” arXiv preprint arXiv:2303.11032, 2023.
  • J. Giorgi, A. Toma, R. Xie, S. Chen, K. An, G. Zheng, and B. Wang, “Wanglab at mediqa-chat 2023: Clinical note generation from doctor-patient conversations using large language models,” in Proceedings of the 5th Clinical Natural Language Processing Work- shop, 2023, pp. 323–334.
  • H. Nori, N. King, S. M. McKinney, D. Carignan, and E. Horvitz, “Capabilities of gpt-4 on medical challenge problems,” ArXiv, vol. abs/2303.13375, 2023.
  • Q. Chen, J. Du, Y. Hu, V. K. Keloth, X. Peng, K. Raja, R. Zhang, Z. Lu, and H. Xu, “Large language models in biomedical natural language processing: benchmarks, baselines, and recommenda- tions,” arXiv preprint arXiv:2305.16326, 2023.
  • Y. Tanaka, T. Nakata, K. Aiga, T. Etani, R. Muramatsu, S. Katagiri, H. Kawai, F. Higashino, M. Enomoto, M. Noda, M. Kometani, M. Takamura, T. Yoneda, H. Kakizaki, and A. Nomura, “Per- formance of generative pretrained transformer on the national medical licensing examination in japan,” in medRxiv, 2023.
  • J. Liu, P. Zhou, Y. Hua, D. Chong, Z. Tian, A. Liu, H. Wang, C. You, Z. Guo, L. Zhu et al., “Benchmarking large language models on cmexam–a comprehensive chinese medical exam dataset,” arXiv preprint arXiv:2306.03030, 2023.
  • Z. Yang, S. Cherian, and S. Vucetic, “Data augmentation for radiology report simplification,” in Findings of the Association for Computational Linguistics: EACL 2023, 2023, pp. 1877–1887.
  • C. Ma, Z. Wu, J. Wang, S. Xu, Y. Wei, Z. Liu, L. Guo, X. Cai, S. Zhang, T. Zhang et al., “Impressiongpt: an iterative optimizing framework for radiology report summarization with chatgpt,” arXiv preprint arXiv:2304.08448, 2023.
  • M. Moradi, K. Blagec, F. Haberl, and M. Samwald, “Gpt-3 models are poor few-shot learners in the biomedical domain,” arXiv
  • [329]   M. Agrawal, S. Hegselmann, H. Lang, Y. Kim, and D. Sontag, “Large language models are few-shot clinical information extrac- tors,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 1998–2022.
  • [330]   V. Nair, E. Schumacher, and A. Kannan, “Generating medically- accurate summaries of patient-provider dialogue: A multi- stage approach using large language models,” arXiv preprint arXiv:2305.05982, 2023.
  • C. Shaib, M. L. Li, S. Joseph, I. J. Marshall, J. J. Li, and B. C. Wallace, “Summarizing, simplifying, and synthesizing medical evidence using gpt-3 (with varying success),” arXiv preprint arXiv:2305.06299, 2023.
  • J. Xu, L. Lu, S. Yang, B. Liang, X. Peng, J. Pang, J. Ding, X. Shi, L. Yang, H. Song et al., “Medgpteval: A dataset and benchmark to evaluate responses of large language models in medicine,” arXiv preprint arXiv:2305.07340, 2023.
  • [333]   X. Wang, Z. Gong, G. Wang, J. Jia, Y. Xu, J. Zhao, Q. Fan, S. Wu, W. Hu, and X. Li, “Chatgpt performs on the chinese national medical licensing examination,” 2023.
  • [334]   K. A. Carpenter and R. B. Altman, “Using gpt-3 to build a lexicon of drugs of abuse synonyms for social media pharmacovigi- lance,” Biomolecules, vol. 13, no. 2, p. 387, 2023.
  • [335]   E. Hernandez, D. Mahajan, J. Wulff, M. J. Smith, Z. Ziegler, D. Nadler, P. Szolovits, A. Johnson, E. Alsentzer et al., “Do we still need clinical language models?” in Conference on Health, Inference, and Learning. PMLR, 2023, pp. 578–597.
  • [336]   A. S. Rao, M. Pang, J. Kim, M. Kamineni, W. Lie, A. K. Prasad, Landman, K. Dryer, and M. D. Succi, “Assessing the utility of chatgpt throughout the entire clinical workflow,” medRxiv, pp. 2023–02, 2023.
  • T. H. Kung, M. Cheatham, A. Medenilla, C. Sillos, L. De Leon, C. Elepan˜o, M. Madriaga, R. Aggabao, G. Diaz-Candido, J. Maningo et al., “Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models,” PLoS digital health, vol. 2, no. 2, p. e0000198, 2023.
  • [338]   A. Hulman, O. L. Dollerup, J. F. Mortensen, M. Fenech, K. Nor- man, H. Stoevring, and T. K. Hansen, “Chatgpt-versus human- generated answers to frequently asked questions about diabetes: a turing test-inspired survey among employees of a danish diabetes center,” medRxiv, pp. 2023–02, 2023.
  • [339]   T. Hirosawa, Y. Harada, M. Yokose, T. Sakamoto, R. Kawamura, and T. Shimizu, “Diagnostic accuracy of differential-diagnosis lists generated by generative pretrained transformer 3 chatbot for clinical vignettes with common chief complaints: A pilot study,” International journal of environmental research and public health, vol. 20, no. 4, p. 3378, 2023.
  • [340]   S. Liu, A. P. Wright, B. L. Patterson, J. P. Wanderer, R. W. Turer, S. D. Nelson, A. B. McCoy, D. F. Sittig, and A. Wright, “Assessing the value of chatgpt for clinical decision support optimization,” MedRxiv, pp. 2023–02, 2023.
  • [341]   A. Gilson, C. W. Safranek, T. Huang, V. Socrates, L. Chi, R. A. Taylor, D. Chartash et al., “How does chatgpt perform on the united states medical licensing examination? the implications of large language models for medical education and knowledge assessment,” JMIR Medical Education, vol. 9, no. 1, p. e45312, 2023.
  • F. Antaki, S. Touma, D. Milad, J. El-Khoury, and R. Duval, “Evaluating the performance of chatgpt in ophthalmology: An analysis of its successes and shortcomings,” Ophthalmology Sci- ence, p. 100324, 2023.
  • [343]   Q. Lyu, J. Tan, M. E. Zapadka, J. Ponnatapura, C. Niu, K. J. Myers, G. Wang, and C. T. Whitlow, “Translating radiology reports into plain language using chatgpt and gpt-4 with prompt learning: results, limitations, and potential,” Visual Computing for Industry, Biomedicine, and Art, vol. 6, no. 1, p. 9, 2023.
  • [344]   F. Yu, L. Quartey, and F. Schilder, “Legal prompting: Teach-preprint arXiv:2109.02555, 2021. ing a language model to think like a lawyer,” arXiv preprint
  • K. Jeblick, B. Schachtner, J. Dexl, A. Mittermeier, A. T. Stuber, J. Topalis, T. Weber, P. Wesp, B. Sabel, J. Ricke et al., “Chatgpt makes medicine easy to swallow: An exploratory case study on simplified radiology reports,” arXiv preprint arXiv:2212.14882, 2022.
  • X. Tang, A. Tran, J. Tan, and M. Gerstein, “Gersteinlab at mediqa- chat 2023: Clinical note summarization from doctor-patient con- versations through fine-tuning and in-context learning,” arXiv preprint arXiv:2305.05001, 2023. arXiv:2212.01326, 2022.
  • H.-T. Nguyen, “A brief report on lawgpt 1.0: A virtual legal assistant based on gpt-3,” arXiv preprint arXiv:2302.05729, 2023.
  • [346]   I. Chalkidis, “Chatgpt may pass the bar exam soon, but has a long way to go for the lexglue benchmark,” arXiv preprint arXiv:2304.12202, 2023.
  • J. H. Choi, K. E. Hickman, A. Monahan, and D. Schwarcz, “Chatgpt goes to law school,” Available at SSRN, 2023.
  • [348]   X. Cai, S. Liu, J. Han, L. Yang, Z. Liu, and T. Liu, “Chestxray- bert: A pretrained language model for chest radiology report summarization,” IEEE Transactions on Multimedia, 2021.
  • H. Xiong, S. Wang, Y. Zhu, Z. Zhao, Y. Liu, Q. Wang, and D. Shen, “Doctorglm: Fine-tuning your chinese doctor is not a herculean task,” arXiv preprint arXiv:2304.01097, 2023.
  • [350]   A. B. Abacha, W.-w. Yim, G. Adams, N. Snider, and M. Yetisgen- Yildiz, “Overview of the mediqa-chat 2023 shared tasks on the summarization & generation of doctor-patient conversations,” in Proceedings of the 5th Clinical Natural Language Processing Workshop, 2023, pp. 503–513.
  • [351]   H. Su, J. Kasai, Y. Wang, Y. Hu, M. Ostendorf, W.-t. Yih, N. A. Smith, L. Zettlemoyer, T. Yu et al., “One embedder, any task: Instruction-finetuned text embeddings,” arXiv preprint arXiv:2212.09741, 2022.
  • Y. Lan, Y. Wu, W. Xu, W. Feng, and Y. Zhang, “Chinese fine- grained financial sentiment analysis with large language mod- els,” arXiv preprint arXiv:2306.14096, 2023.
  • [353]   G. Fatouros, J. Soldatos, K. Kouroumali, G. Makridis, and D. Kyr- iazis, “Transforming sentiment analysis in the financial domain with chatgpt,” arXiv preprint arXiv:2308.07935, 2023.
  • [354]   M. Leippold, “Sentiment spin: Attacking financial sentiment with gpt-3,” Finance Research Letters, p. 103957, 2023.
  • [355]   P. Wiriyathammabhum, “Promptshots at the finnlp-2022 erai task: Pairwise comparison and unsupervised ranking,” in Pro- ceedings of the Fourth Workshop on Financial Technology and Natural Language Processing (FinNLP), 2022, pp. 104–110.
  • [356]   A. Shah and S. Chava, “Zero is not hero yet: Benchmarking zero-shot performance of llms for financial tasks,” arXiv preprint arXiv:2305.16633, 2023.
  • L. Zhang, W. Cai, Z. Liu, Z. Yang, W. Dai, Y. Liao, Q. Qin, Y. Li, X. Liu, Z. Liu et al., “Fineval: A chinese financial domain knowledge evaluation benchmark for large language models,” arXiv preprint arXiv:2308.09975, 2023.
  • [358]   P. K. Rajpoot and A. Parikh, “Gpt-finre: In-context learning for financial relation extraction using large language models,” arXiv preprint arXiv:2306.17519, 2023.
  • [359]   L. Loukas, I. Stogiannidis, P. Malakasiotis, and S. Vassos, “Break- ing the bank with chatgpt: Few-shot text classification for fi- nance,” arXiv preprint arXiv:2308.14634, 2023.
  • [360]   I. Chalkidis, A. Jana, D. Hartung, M. Bommarito, I. Androut- sopoulos, D. Katz, and N. Aletras, “Lexglue: A benchmark dataset for legal language understanding in english,” in Proceed- ings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 4310–4330.
  • [361]   J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reason- ing in large language models,” Advances in Neural Information Processing Systems, vol. 35, pp. 24 824–24 837, 2022.
  • [362]   Z. Chen, W. Chen, C. Smiley, S. Shah, I. Borova, D. Langdon, R. Moussa, M. Beane, T.-H. Huang, B. R. Routledge et al., “Finqa: A dataset of numerical reasoning over financial data,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 3697–3711.
  • [363]   V. D. Lai, N. T. Ngo, A. P. B. Veyseh, H. Man, F. Dernoncourt, T. Bui, and T. H. Nguyen, “Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilin- gual learning,” arXiv preprint arXiv:2304.05613, 2023.
  • [364]   T. Fang, S. Yang, K. Lan, D. F. Wong, J. Hu, L. S. Chao, and Y. Zhang, “Is chatgpt a highly fluent grammatical error correction system? a comprehensive evaluation,” arXiv preprint arXiv:2304.01746, 2023.
  • J. Armengol-Estape´, O. de Gibert Bonet, and M. Melero, “On the multilingual capabilities of very large-scale english language models,” in Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022, pp. 3056–3068.
  • [366]   K. Ahuja, R. Hada, M. Ochieng, P. Jain, H. Diddee, S. Maina, T. Ganu, S. Segal, M. Axmed, K. Bali et al., “Mega: Multilingual evaluation of generative ai,” arXiv preprint arXiv:2303.12528, 2023.
  • X. Zhang, S. Li, B. Hauer, N. Shi, and G. Kondrak, “Don’t trust gpt when your question is not in english,” arXiv preprint arXiv:2305.16339, 2023.
  • M. Das, S. K. Pandey, and A. Mukherjee, “Evaluating chatgpt’s performance for multilingual and emoji-based hate speech de- tection,” arXiv preprint arXiv:2305.13276, 2023.
  • [369]   R. Hada, V. Gumma, A. de Wynter, H. Diddee, M. Ahmed M. Choudhury, K. Bali, and S. Sitaram, “Are large language model-based evaluators the solution to scaling up multilingual evaluation?” arXiv preprint arXiv:2309.07462, 2023.
  • [370]   W.  Q.  Leong,  J.  G.  Ngui,  Y.  Susanto,  H.  Rengarajan, K. Sarveswaran, and W. C. Tjhi, “Bhasa: A holistic southeast asian linguistic and cultural evaluation suite for large language models,” arXiv preprint arXiv:2309.06085, 2023.
  • [371]   R. Bommasani, P. Liang, and T. Lee, “Holistic evaluation of language models,” Annals of the New York Academy of Sciences, 2023.
  • A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso et al., “Beyond the imitation game: Quantifying and extrapolat- ing the capabilities of language models,” Transactions on Machine Learning Research, 2023.
  • [373]   F. Gilardi, M. Alizadeh, and M. Kubli, “Chatgpt outper- forms crowd-workers for text-annotation tasks,” arXiv preprint arXiv:2303.15056, 2023.
  • X. He, Z. Lin, Y. Gong, A. Jin, H. Zhang, C. Lin, J. Jiao, S. M. Yiu, N. Duan, W. Chen et al., “Annollm: Making large language models to be better crowdsourced annotators,” arXiv preprint arXiv:2303.16854, 2023.
  • P. To¨ rnberg, “Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning,” arXiv preprint arXiv:2304.06588, 2023.
  • [376]   Y. Zhu, P. Zhang, E.-U. Haq, P. Hui, and G. Tyson, “Can chatgpt reproduce human-generated labels? a study of social computing tasks,” arXiv preprint arXiv:2304.10145, 2023.
  • [377]   L. Li, L. Fan, S. Atreja, and L. Hemphill, “” hot” chatgpt: The promise of chatgpt in detecting and discriminating hateful, offensive, and toxic comments on social media,” arXiv preprint arXiv:2304.10619, 2023.
  • Y. Gu, S. Zhang, N. Usuyama, Y. Woldesenbet, C. Wong, P. Sana- pathi, M. Wei, N. Valluri, E. Strandberg, T. Naumann et al., “Distilling large language models for biomedical knowledge extraction: A case study on adverse drug events,” arXiv preprint arXiv:2307.06439, 2023.
  • S. Wang, Y. Liu, Y. Xu, C. Zhu, and M. Zeng, “Want to reduce labeling cost? gpt-3 can help,” in Findings of the Association for Computational Linguistics: EMNLP 2021, 2021, pp. 4195–4205.
  • [380]   B. Ding, C. Qin, L. Liu, L. Bing, S. Joty, and B. Li, “Is gpt-3 a good data annotator?” arXiv preprint arXiv:2212.10450, 2022.
  • [381]   S. Meoni, E. De la Clergerie, and T. Ryffel, “Large language models as instructors: A study on multilingual clinical entity extraction,” in The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, 2023, pp. 178–190.
  • [382]   Y. Xu, R. Xu, D. Iter, Y. Liu, S. Wang, C. Zhu, and M. Zeng, “Inheritsumm: A general, versatile and compact summarizer by distilling from gpt,” arXiv preprint arXiv:2305.13083, 2023.
  • [383]   M. Alizadeh, M. Kubli, Z. Samei, S. Dehghani, J. D. Bermeo, M. Korobeynikova, and F. Gilardi, “Open-source large language models outperform crowd workers and approach chatgpt in text- annotation tasks,” arXiv preprint arXiv:2307.02179, 2023.
  • [384]   S. Thapa, U. Naseem, and M. Nasim, “From humans to ma- chines: can chatgpt-like llms effectively replace human annota- tors in nlp tasks,” in Workshop Proceedings of the 17th International AAAI Conference on Web and Social Media, 2023.
  • [385]   J. S. Murthy, G. Siddesh, and K. Srinivasa, “Twitsenti: a real- time twitter sentiment analysis and visualization framework,” Journal of Information & Knowledge Management, vol. 18, no. 02, p. 1950013, 2019.
  • W. Van Atteveldt, M. A. Van der Velden, and M. Boukes, “The validity of sentiment analysis: Comparing manual annotation, crowd-coding, dictionary approaches, and machine learning al- gorithms,” Communication Methods and Measures, vol. 15, no. 2, pp. 121–140, 2021.
  • [387]   M. Chmielewski and S. C. Kucker, “An mturk crisis? shifts in data quality and the impact on study results,” Social Psychological and Personality Science, vol. 11, no. 4, pp. 464–473, 2020.
  • [388]   P. He, B. Peng, L. Lu, S. Wang, J. Mei, Y. Liu, R. Xu, H. H. Awadalla, Y. Shi, C. Zhu et al., “Z-code++: A pre-trained lan- guage model optimized for abstractive summarization,” arXiv preprint arXiv:2208.09770, 2022.
  • [389]   H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fe- dus, E. Li, X. Wang, M. Dehghani, S. Brahma et al., “Scaling instruction-finetuned language models,” arXiv preprint arXiv:2210.11416, 2022.
  • J. Cegin, J. Simko, and P. Brusilovsky, “Chatgpt to replace crowdsourcing of paraphrases for intent classification: Higher diversity and comparable model robustness,” arXiv preprint arXiv:2305.12947, 2023.
  • S. Oh, W. Jung et al., “Data augmentation for neural machine translation using generative language model,” arXiv preprint arXiv:2307.16833, 2023.
  • S. Sharma, A. Joshi, N. Mukhija, Y. Zhao, H. Bhathena, P. Singh, S. Santhanam, and P. Biswas, “Systematic review of effect of data augmentation using paraphrasing on named entity recognition,” in NeurIPS 2022 Workshop on Synthetic Data for Empowering ML Research, 2022.
  • Z. Guo, P. Wang, Y. Wang, and S. Yu, “Dr. llama: Improving small language models in domain-specific qa via generative data augmentation,” arXiv preprint arXiv:2305.07804, 2023.
  • [394]   A. Abaskohi, S. Rothe, and Y. Yaghoobzadeh, “Lm-cppf: Paraphrasing-guided data augmentation for contrastive prompt- based few-shot fine-tuning,” arXiv preprint arXiv:2305.18169, 2023.
  • S. Sarker, L. Qian, and X. Dong, “Medical data augmentation via chatgpt: A case study on medication identification and medication event classification,” arXiv preprint arXiv:2306.07297, 2023.
  • H. Dai, Z. Liu, W. Liao, X. Huang, Y. Cao, Z. Wu, L. Zhao, S. Xu, W. Liu, N. Liu et al., “Auggpt: Leveraging chatgpt for text data augmentation,” arXiv preprint arXiv:2302.13007, 2023.
  • [397]   Y. Fang, X. Li, S. W. Thomas, and X. Zhu, “Chatgpt as data augmentation for compositional generalization: A case study in open intent detection,” arXiv preprint arXiv:2308.13517, 2023.
  • [398]   C. Shorten and T. M. Khoshgoftaar, “A survey on image data augmentation for deep learning,” Journal of big data, vol. 6, no. 1, pp. 1–48, 2019.
  • [399]   B. Li, Y. Hou, and W. Che, “Data augmentation approaches in natural language processing: A survey,” Ai Open, vol. 3, pp. 71– 90, 2022.
  • P. Liu, X. Wang, C. Xiang, and W. Meng, “A survey of text data augmentation,” in 2020 International Conference on Computer Communication and Network Security (CCNS). IEEE, 2020, pp. 191–195.
  • S. Y. Feng, V. Gangal, J. Wei, S. Chandar, S. Vosoughi, T. Mita- mura, and E. Hovy, “A survey of data augmentation approaches for nlp,” in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021, pp. 968–988.
  • [402]   M. Bayer, M.-A. Kaufhold, and C. Reuter, “A survey on data augmentation for text classification,” ACM Computing Surveys, vol. 55, no. 7, pp. 1–39, 2022.
  • [403]   Y. Belinkov and Y. Bisk, “Synthetic and natural noise both break neural machine translation,” in International Conference on Learning Representations, 2018.
  • [404]   C. Coulombe, “Text data augmentation made simple by lever- aging nlp cloud apis,” arXiv preprint arXiv:1812.04718, 2018.
  • [405]   J. Wei and K. Zou, “Eda: Easy data augmentation techniques for boosting performance on text classification tasks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 6382–6388.
  • [406]   W. Y. Wang and D. Yang, “That’s so annoying!!!: A lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using# pet- peeve tweets,” in Proceedings of the 2015 conference on empirical methods in natural language processing, 2015, pp. 2557–2563.
  • [407]   R. Sennrich, B. Haddow, and A. Birch, “Improving neural ma- chine translation models with monolingual data,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016, pp. 86–96.
  • [408]   C. Mallikarjuna and S. Sivanesan, “Question classification using limited labelled data,” Information Processing & Management, vol. 59, no. 6, p. 103094, 2022.
  • [409]   H. Zhan, Z. Li, Y. Wang, L. Luo, T. Feng, X. Kang, Y. Hua, L. Qu, L.-K. Soon, S. Sharma et al., “Socialdial: A bench- mark for socially-aware dialogue systems,” arXiv preprint arXiv:2304.12026, 2023.
  • J. Wang, Z. Yao, A. Mitra, S. Osebe, Z. Yang, and H. Yu, “Umass bionlp at mediqa-chat 2023: Can llms generate high-quality synthetic note-oriented doctor-patient conversations?” arXiv preprint arXiv:2306.16931, 2023.
  • [411]    S. Gunasekar, Y. Zhang, J. Aneja, C. C. T. Mendes, A. D. Giorno, S. Gopi, M. Javaheripi, P. C. Kauffmann, G. de Rosa, O. Saarikivi, A. Salim, S. Shah, H. S. Behl, X. Wang, S. Bubeck, R. Eldan, A. T. Kalai, Y. T. Lee, and Y.-F. Li, “Textbooks are all you need,” ArXiv, vol. abs/2306.11644, 2023.
  • C. Whitehouse, M. Choudhury, and A. F. Aji, “Llm-powered data augmentation for enhanced crosslingual performance,” ArXiv, vol. abs/2305.14288, 2023.
  • T. Hartvigsen, S. Gabriel, H. Palangi, M. Sap, D. Ray, and E. Kamar, “Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 3309–3326.
  • [414]   T. Markov, C. Zhang, S. Agarwal, F. E. Nekoul, T. Lee, S. Adler, A. Jiang, and L. Weng, “A holistic approach to undesired content detection in the real world,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 12, 2023, pp. 15 009–15 018.
  • [415]   Z. Guo, P. Wang, Y. Wang, and S. Yu, “Dr. llama: Improving small language models on pubmedqa via generative data augmenta- tion,” ArXiv, vol. abs/2305.07804, 2023.
  • R. Eldan and Y. Li, “Tinystories: How small can language models be and still speak coherent english?” arXiv preprint arXiv:2305.07759, 2023.
  • H. Liu, Z. Teng, L. Cui, C. Zhang, Q. Zhou, and Y. Zhang, “Logi- cot: Logical chain-of-thought instruction-tuning data collection with gpt-4,” arXiv preprint arXiv:2305.12147, 2023.
  • [418]   B. Peng, C. Li, P. He, M. Galley, and J. Gao, “Instruction tuning with gpt-4,” arXiv preprint arXiv:2304.03277, 2023.
  • [419]   I. Malkiel, U. Alon, Y. Yehuda, S. Keren, O. Barkan, R. Ronen, and N. Koenigstein, “Gpt-calls: Enhancing call segmentation and tagging by generating synthetic conversations via large language models,” arXiv preprint arXiv:2306.07941, 2023.
  • [420]   J. P. Wahle, T. Ruas, F. Kirstein, and B. Gipp, “How large language models are transforming machine-paraphrase plagia- rism,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 952–963.
  • A. Michail, S. Konstantinou, and S. Clematide, “Uzh clyp at semeval-2023 task 9: Head-first fine-tuning and chatgpt data gen- eration for cross-lingual learning in tweet intimacy prediction,” arXiv preprint arXiv:2303.01194, 2023.
  • [422]   R. Tang, X. Han, X. Jiang, and X. Hu, “Does synthetic data generation of llms help clinical text mining?” arXiv preprint arXiv:2303.04360, 2023.
  • Y. Yu, Y. Zhuang, J. Zhang, Y. Meng, A. Ratner, R. Krishna, J. Shen, and C. Zhang, “Large language model as attributed training data generator: A tale of diversity and bias,” arXiv preprint arXiv:2306.15895, 2023.
  • [424]   W. Yang and G. Nicolai, “Neural machine translation data generation and augmentation using chatgpt,” arXiv preprint arXiv:2307.05779, 2023.
  • Y. Zhao, C. Zhao, L. Nan, Z. Qi, W. Zhang, X. Tang, B. Mi, and D. Radev, “Robut: A systematic study of table qa robust- ness against human-annotated adversarial perturbations,” arXiv preprint arXiv:2306.14321, 2023.
  • [426]   W. Xu, D. Wang, L. Pan, Z. Song, M. Freitag, W. Y. Wang, and L. Li, “Instructscore: Towards explainable text generation evalu- ation with automatic feedback,” arXiv preprint arXiv:2305.14282, 2023.
  • A. Sugiyama and N. Yoshinaga, “Data augmentation using back- translation for context-aware neural machine translation,” in Proceedings of the fourth workshop on discourse in machine translation (DiscoMT 2019), 2019, pp. 35–44.
  • [428]   F. Mireshghallah, J. Mattern, S. Gao, R. Shokri, and T. Berg- Kirkpatrick, “Smaller language models are better black-box machine-generated text detectors,” ArXiv, vol. abs/2305.09859, 2023.
  • B. Guo, X. Zhang, Z. Wang, M. Jiang, J. Nie, Y. Ding, J. Yue, and Y. Wu, “How close is chatgpt to human experts? comparison corpus, evaluation, and detection,” ArXiv, vol. abs/2301.07597, 2023.
  • P. Hacker, A. Engel, and M. Mauer, “Regulating chatgpt and other large generative ai models,” in Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, 2023, pp. 1112–1123.
  • [431]   L. De Angelis, F. Baglivo, G. Arzilli, G. P. Privitera, P. Ferragina, A. E. Tozzi, and C. Rizzo, “Chatgpt and the rise of large language models: the new ai-driven infodemic threat in public health,” Frontiers in Public Health, vol. 11, p. 1166120, 2023.
  • [432]   S. Mitrovi’c, D. Andreoletti, and O. Ayoub, “Chatgpt or human? detect and explain. explaining decisions of machine learning model for detecting short chatgpt-generated text,” ArXiv, vol. abs/2301.13852, 2023.
  • C. A. Gao, F. M. Howard, N. S. Markov, E. C. Dyer, S. Ramesh, Y. Luo, and A. T. Pearson, “Comparing scientific abstracts gen- erated by chatgpt to real abstracts with detectors and blinded human reviewers,” NPJ Digital Medicine, vol. 6, no. 1, p. 75, 2023.
  • [434]   D. R. Cotton, P. A. Cotton, and J. R. Shipway, “Chatting and cheating: Ensuring academic integrity in the era of chatgpt,” Innovations in Education and Teaching International, pp. 1–12, 2023.
  • [435]   P. C. Theocharopoulos, P. Anagnostou, A. Tsoukala, S. V. Georgakopoulos, S. K. Tasoulis, and V. P. Plagianakos, “De- tection of fake generated scientific abstracts,” arXiv preprint arXiv:2304.06148, 2023.
  • W. Zaitsu and M. Jin, “Distinguishing chatgpt (-3.5,-4)-generated and human-written papers through japanese stylometric analy- sis,” arXiv preprint arXiv:2304.05534, 2023.
  • [437]   P. Yu, J. Chen, X. Feng, and Z. Xia, “Cheat: A large-scale dataset for detecting chatgpt-written abstracts,” arXiv preprint arXiv:2304.12008, 2023.
  • X. Yang, W. Cheng, L. Petzold, W. Y. Wang, and H. Chen, “Dna- gpt: Divergent n-gram analysis for training-free detection of gpt- generated text,” arXiv preprint arXiv:2305.17359, 2023.
  • [439]   Y. Liu, Z. Zhang, W. Zhang, S. Yue, X. Zhao, X. Cheng, Y. Zhang, and H. Hu, “Argugpt: evaluating, understanding and identifying argumentative essays generated by gpt models,” arXiv preprint arXiv:2304.07666, 2023.
  • M. S. Orenstrakh, O. Karnalim, C. A. Suarez, and M. Liut, “Detecting llm-generated text in computing education: A com- parative study for chatgpt cases,” arXiv preprint arXiv:2307.07411, 2023.
  • W. Liao, Z. Liu, H. Dai, S. Xu, Z. Wu, Y. Zhang, X. Huang, D. Zhu, H. Cai, T. Liu et al., “Differentiate chatgpt-generated and human- written medical texts,” arXiv preprint arXiv:2304.11567, 2023.
  • [442]   H. Zhan, X. He, Q. Xu, Y. Wu, and P. Stenetorp, “G3detector: General gpt-generated text detector,” arXiv preprint arXiv:2305.12680, 2023.
  • E. Clark, T. August, S. Serrano, N. Haduong, S. Gururangan, and N. A. Smith, “All that’s ‘human’is not gold: Evaluating human evaluation of generated text,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 7282–7296.
  • [444]   A. Pegoraro, K. Kumari, H. Fereidooni, and A.-R. Sadeghi, “To chatgpt, or not to chatgpt: That is the question!” arXiv preprint arXiv:2304.01487, 2023.
  • Z. Shi, Y. Wang, F. Yin, X. Chen, K.-W. Chang, and C.-J. Hsieh, “Red teaming language model detectors with language models,” arXiv preprint arXiv:2305.19713, 2023.
  • [446]   M. Khalil and E. Er, “Will chatgpt get you caught? rethinking of plagiarism detection,” arXiv preprint arXiv:2302.04335, 2023.
  • [447]   X. He, X. Shen, Z. Chen, M. Backes, and Y. Zhang, “Mgtbench: Benchmarking machine-generated text detection,” arXiv preprint arXiv:2303.14822, 2023.
  • H. Wang, X. Luo, W. Wang, and X. Yan, “Bot or human? detecting chatgpt imposters with a single question,” ArXiv, vol. abs/2305.06424, 2023.
  • Y. Chen, H. Kang, V. Zhai, L. Li, R. Singh, and B. Ramakrish- nan, “Gpt-sentinel: Distinguishing human and chatgpt generated content,” ArXiv, vol. abs/2305.07969, 2023.
  • X. Yu, Y. Qi, K. Chen, G. Chen, X. Yang, P. Zhu, W. Zhang, and N. H. Yu, “Gpt paternity test: Gpt generated text detection with gpt genetic inheritance,” ArXiv, vol. abs/2305.12519, 2023.
  • [451]   L. Yang, F. Jiang, and H. Li, “Is chatgpt involved in texts? measure the polish ratio to detect chatgpt-generated text,” ArXiv, vol. abs/2307.11380, 2023.
  • K. Krishna, Y. Song, M. Karpinska, J. Wieting, and M. Iyyer, “Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense,” arXiv preprint arXiv:2303.13408, 2023.
  • [453]   D. Ippolito, D. Duckworth, C. Callison-Burch, and D. Eck, “Au- tomatic detection of generated text is easiest when humans are fooled,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 1808–1822.
  • [454]   S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” Advances in neural information processing systems, vol. 30, 2017.
  • [455]   X. Chen, J. Ye, C. Zu, N. Xu, R. Zheng, M. Peng, J. Zhou, T. Gui, Q. Zhang, and X. Huang, “How robust is gpt-3.5 to predeces- sors? a comprehensive study on language understanding tasks,” arXiv preprint arXiv:2303.00293, 2023.
  • [456]   J. Wang, X. Hu, W. Hou, H. Chen, R. Zheng, Y. Wang, L. Yang, H. Huang, W. Ye, X. Geng et al., “On the robustness of chat- gpt: An adversarial and out-of-distribution perspective,” arXiv preprint arXiv:2302.12095, 2023.
  • [457]   T. Y. Zhuo, Z. Li, Y. Huang, Y.-F. Li, W. Wang, G. Haffari, and F. Shiri, “On robustness of prompt-based semantic parsing with large pre-trained language model: An empirical study on codex,” arXiv preprint arXiv:2301.12868, 2023.
  • [458]   K. Zhu, J. Wang, J. Zhou, Z. Wang, H. Chen, Y. Wang, L. Yang, W. Ye, N. Z. Gong, Y. Zhang et al., “Promptbench: Towards eval- uating the robustness of large language models on adversarial prompts,” arXiv preprint arXiv:2306.04528, 2023.
  • [459]   A. Shirafuji, Y. Watanobe, T. Ito, M. Morishita, Y. Nakamura, Y. Oda, and J. Suzuki, “Exploring the robustness of large language models for solving programming problems,” arXiv preprint arXiv:2306.14583, 2023.
  • [460]   R. Han, T. Peng, C. Yang, B. Wang, L. Liu, and X. Wan, “Is information extraction solved by chatgpt? an analysis of perfor- mance, evaluation criteria, robustness and errors,” arXiv preprint arXiv:2305.14450, 2023.
  • H. Liu, R. Ning, Z. Teng, J. Liu, Q. Zhou, and Y. Zhang, “Evaluating the logical reasoning ability of chatgpt and gpt-4,” arXiv preprint arXiv:2304.03439, 2023.
  • [462]   A. Liu, X. Hu, L. Wen, and P. S. Yu, “A comprehensive evalu- ation of chatgpt’s zero-shot text-to-sql capability,” arXiv preprint arXiv:2303.13547, 2023.
  • E. Mitchell, Y. Lee, A. Khazatsky, C. D. Manning, and C. Finn, “Detectgpt: Zero-shot machine-generated text detection using probability curvature,” arXiv preprint arXiv:2301.11305, 2023.
  • [464]   S. Goyal, S. Doddapaneni, M. M. Khapra, and B. Ravindran, “A survey of adversarial defences and robustness in nlp,” ACM Computing Surveys, 2022.
  • [465]   S. Qiu, Q. Liu, S. Zhou, and W. Huang, “Adversarial attack and defense technologies in natural language processing: A survey,” Neurocomputing, vol. 492, pp. 278–307, 2022.
  • [466]   Z. Shen, J. Liu, Y. He, X. Zhang, R. Xu, H. Yu, and P. Cui, “Towards out-of-distribution generalization: A survey,” arXiv preprint arXiv:2108.13624, 2021.
  • [467]   X. Wang, Q. Liu, T. Gui, Q. Zhang, Y. Zou, X. Zhou, J. Ye, Y. Zhang, R. Zheng, Z. Pang et al., “Textflint: Unified multilingual robustness evaluation toolkit for natural language processing,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, 2021, pp. 347–355.
  • Y. Chen, R. Wang, H. Jiang, S. Shi, and R. Xu, “Exploring the use of large language models for reference-free text qual- ity evaluation: A preliminary empirical study,” arXiv preprint arXiv:2304.00723, 2023.
  • A. B. Sai, A. K. Mohankumar, and M. M. Khapra, “A survey of evaluation metrics used for nlg systems,” ACM Computing Surveys (CSUR), vol. 55, no. 2, pp. 1–39, 2022.
  • [470]   T. Y. Zhuo, “Large language models are state-of-the-art evalua- tors of code generation,” arXiv preprint arXiv:2304.14317, 2023.
  • [471]   H. Lai, A. Toral, and M. Nissim, “Multidimensional eval- uation for text style transfer using chatgpt,” arXiv preprint arXiv:2304.13462, 2023.
  • Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, and C. Zhu, “Gpteval: Nlg evaluation using gpt-4 with better human alignment,” arXiv preprint arXiv:2303.16634, 2023.
  • [473]   T. Kocmi and C. Federmann, “Large language models are state-of-the-art evaluators of translation quality,” arXiv preprint arXiv:2302.14520, 2023.
  • Q. Lu, B. Qiu, L. Ding, L. Xie, and D. Tao, “Error analysis prompting enables human-like translation evaluation in large language models: A case study on chatgpt,” arXiv preprint arXiv:2303.13809, 2023.
  • [475]   Z. Luo, Q. Xie, and S. Ananiadou, “Chatgpt as a factual incon- sistency evaluator for text summarization,” 2023.
  • [476]   C. Shen, L. Cheng, Y. You, and L. Bing, “Are large language models good evaluators for abstractive summarization?” arXiv preprint arXiv:2305.13091, 2023.
  • [477]   J. Fu, S.-K. Ng, Z. Jiang, and P. Liu, “Gptscore: Evaluate as you desire,” arXiv preprint arXiv:2302.04166, 2023.
  • [478]   Y. Liu, A. R. Fabbri, P. Liu, D. Radev, and A. Cohan, “On learning to summarize with large language models as references,” arXiv preprint arXiv:2305.14239, 2023.
  • [479]   M. Gao, J. Ruan, R. Sun, X. Yin, S. Yang, and X. Wan, “Human- like summarization evaluation with chatgpt,” arXiv preprint arXiv:2304.02554, 2023.
  • T. Tang, H. Lu, Y. E. Jiang, H. Huang, D. Zhang, W. X. Zhao, and F. Wei, “Not all metrics are guilty: Improving nlg evaluation with llm paraphrasing,” arXiv preprint arXiv:2305.15067, 2023.
  • [481]   P. Wang, L. Li, L. Chen, D. Zhu, B. Lin, Y. Cao, Q. Liu, T. Liu, and Z. Sui, “Large language models are not fair evaluators,” arXiv preprint arXiv:2305.17926, 2023.
  • [482]   S. Jain, V. Keshava, S. M. Sathyendra, P. Fernandes, P. Liu, G. Neubig, and C. Zhou, “Multi-dimensional evaluation of text summarization with in-context learning,” arXiv preprint arXiv:2306.01200, 2023.
  • J. Wang, Y. Liang, F. Meng, H. Shi, Z. Li, J. Xu, J. Qu, and J. Zhou, “Is chatgpt a good nlg evaluator? a preliminary study,” arXiv preprint arXiv:2303.04048, 2023.
  • [484]   Y. Bai, J. Ying, Y. Cao, X. Lv, Y. He, X. Wang, J. Yu, K. Zeng, Y. Xiao, H. Lyu et al., “Benchmarking foundation models with language-model-as-an-examiner,” arXiv preprint arXiv:2306.04181, 2023.
  • W. Yang, C. Li, J. Zhang, and C. Zong, “Bigtrans: Augmenting large language models with multilingual translation capability over 100 languages,” arXiv preprint arXiv:2305.18098, 2023.
  • [486]   L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing et al., “Judging llm-as-a-judge with mt-bench and chatbot arena,” arXiv preprint arXiv:2306.05685, 2023.
  • K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318.
  • [488]   C.-Y. Lin, “Rouge: A package for automatic evaluation of sum- maries,” in Text summarization branches out, 2004, pp. 74–81.
  • [489]   S. Banerjee and A. Lavie, “Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,” in Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 65– 72.
  • T. Kocmi, C. Federmann, R. Grundkiewicz, M. Junczys- Dowmunt, H. Matsushita, and A. Menezes, “To ship or not to ship: An extensive evaluation of automatic metrics for machine translation,” in Proceedings of the Sixth Conference on Machine Translation, 2021, pp. 478–494.
  • [491]   T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “Bertscore: Evaluating text generation with bert,” in International Conference on Learning Representations, 2019.
  • [492]   W. Zhao, M. Peyrard, F. Liu, Y. Gao, C. M. Meyer, and S. Eger, “Moverscore: Text generation evaluating with contextualized embeddings and earth mover distance,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 563–578.
  • [493]   W. Yuan, G. Neubig, and P. Liu, “Bartscore: Evaluating generated text as text generation,” Advances in Neural Information Processing Systems, vol. 34, pp. 27 263–27 277, 2021.
  • [494]   S. Zhou, U. Alon, S. Agarwal, and G. Neubig, “Codebertscore: Evaluating code generation with pretrained models of code,” arXiv preprint arXiv:2302.05527, 2023.
  • [495]   J. He, W. Krys´cin´ski, B. McCann, N. Rajani, and C. Xiong, “Ctrlsum: Towards generic controllable text summarization,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 5879–5915.
  • [496]   C. Shen, L. Cheng, L. Bing, Y. You, and L. Si, “Sentbs: Sentence- level beam search for controllable summarization,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 10 256–10 265.
  • [497]   Y. Liu, P. Liu, D. Radev, and G. Neubig, “Brio: Bringing order to abstractive summarization,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 2890–2903.
  • Q. Lu, L. Ding, L. Xie, K. Zhang, D. F. Wong, and D. Tao, “Toward human-like evaluation for natural language generation with error analysis,” arXiv preprint arXiv:2212.10179, 2022.
  • R. Bhardwaj and S. Poria, “Red-teaming large language models using chain of utterances for safety-alignment,” arXiv preprint arXiv:2308.09662, 2023.
  • D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai, S. Kadavath, B. Mann, E. Perez, N. Schiefer, K. Ndousse et al., “Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned,” arXiv preprint arXiv:2209.07858, 2022.
  • N. Mehrabi, P. Goyal, C. Dupuy, Q. Hu, S. Ghosh, R. Zemel, K.-W. Chang, A. Galstyan, and R. Gupta, “Flirt: Feedback loop in-context red teaming,” arXiv preprint arXiv:2308.04265, 2023.
  • E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving, “Red teaming language models with language models,” in Proceedings of the 2022 Con- ference on Empirical Methods in Natural Language Processing, 2022, pp. 3419–3448.
  • L. Chen, M. Zaharia, and J. Zou, “Frugalgpt: How to use large language models while reducing cost and improving perfor- mance,” arXiv preprint arXiv:2305.05176, 2023.
  • Z. Cheng, J. Kasai, and T. Yu, “Batch prompting: Efficient inference with large language model apis,” arXiv preprint arXiv:2301.08721, 2023.
  • Y. Li, “Unlocking context constraints of llms: Enhancing context efficiency of llms with self-information-based content filtering,” arXiv preprint arXiv:2304.12102, 2023.
  • M. A. Arefeen, B. Debnath, and S. Chakradhar, “Leancontext: Cost-efficient domain-specific question answering using llms,” arXiv preprint arXiv:2309.00841, 2023.
  • S. Golchin and M. Surdeanu, “Time travel in llms: Tracing data contamination in large language models,” arXiv preprint arXiv:2308.08493, 2023.
  • R. Aiyappa, J. An, H. Kwak, and Y.-Y. Ahn, “Can we trust the evaluation on chatgpt?” arXiv preprint arXiv:2303.12767, 2023.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “Glue: A multi-task benchmark and analysis platform for natural language understanding,” in International Conference on Learning Representations, 2018.
  • X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutional networks for text classification,” Advances in neural information processing systems, vol. 28, 2015.
  • S. Narayan, S. B. Cohen, and M. Lapata, “Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 1797–1807.
  • Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, E. Zhao, Y. Zhang, Y. Chen et al., “Siren’s song in the ai ocean: A survey on hallucination in large language models,” arXiv preprint arXiv:2309.01219, 2023.
  • V. Rawte, A. Sheth, and A. Das, “A survey of hallucination in large foundation models,” arXiv preprint arXiv:2309.05922, 2023.
  • S. Dhuliawala, M. Komeili, J. Xu, R. Raileanu, X. Li, A. Celikyil- maz, and J. Weston, “Chain-of-verification reduces hallucination in large language models,” 2023.
  • L. K. Umapathi, A. Pal, and M. Sankarasubbu, “Med-halt: Med- ical domain hallucination test for large language models,” arXiv preprint arXiv:2307.15343, 2023.
  • J. Li, X. Cheng, W. X. Zhao, J.-Y. Nie, and J.-R. Wen, “Halue- val: A large-scale hallucination evaluation benchmark for large language models,” arXiv e-prints, pp. arXiv–2305, 2023.
  • B. Peng, M. Galley, P. He, H. Cheng, Y. Xie, Y. Hu, Q. Huang, L. Liden, Z. Yu, W. Chen et al., “Check your facts and try again: Improving large language models with external knowledge and automated feedback,” arXiv preprint arXiv:2302.12813, 2023.