Skip to main content
Uncategorized

Large Language Models for Information Retrieval: A Survey

Aug 15, 2023

Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Zhicheng Dou, and Ji-Rong Wen

Abstract—As a primary means of information acquisition, information retrieval (IR) systems, such as search engines, have integrated themselves into our daily lives. These systems also serve as components of dialogue, question-answering, and recommender systems. The trajectory of IR has evolved dynamically from its origins in term-based methods to its integration with advanced neural models. While the neural models excel at capturing complex contextual signals and semantic nuances, thereby reshaping the IR landscape, they still face challenges such as data scarcity, interpretability, and the generation of contextually plausible yet potentially inaccurate responses. This evolution requires a combination of both traditional methods (such as term-based sparse retrieval methods with rapid response) and modern neural architectures (such as language models with powerful language understanding capacity). Meanwhile, the emergence of large language models (LLMs), typified by ChatGPT and GPT-4, has revolutionized natural language processing due to their remarkable language understanding, generation, generalization, and reasoning abilities. Consequently, recent research has sought to leverage LLMs to improve IR systems. Given the rapid evolution of this research trajectory, it is necessary to consolidate existing methodologies and provide nuanced insights through a comprehensive overview. In this survey, we delve into the confluence of LLMs and IR systems, including crucial aspects such as query rewriters, retrievers, rerankers, and readers. Additionally, we explore promising directions within this expanding field.

Index Terms—Large Language Models; Information Retrieval; Query Rewrite; Rerank; Reader; Fine-tuning; Prompting

1 INTRODUCTION

NFORMATION access is one of the fundamental daily needs of human beings. To fulfill the need for rapid acquisition of desired information, various information retrieval (IR) systems have been developed [1–4]. Prominent examples include search engines such as Google, Bing, and Baidu, which serve as IR systems on the Internet, adept at retrieving relevant web pages in response to user queries, and provide convenient and efficient access to information on the Internet. It is worth noting that IR extends beyond web page retrieval. In dialogue systems (chatbots) [1, 5– 8], such as Microsoft Xiaoice [2], Apple Siri,1 and Google Assistant,2 IR systems play a crucial role in retrieving appropriate responses to user input utterances, thereby producing natural and fluent human-machine conversations. Similarly, in question-answering systems [3, 9], IR systems are employed to select relevant clues essential for addressing user questions effectively. In image search engines [4], IR systems excel at returning images that align with user input queries. Given the exponential growth of information, research and industry have become increasingly interested in the development of effective IR systems.

The core function of an IR system is retrieval, which aims to determine the relevance between user-issued queries and the content to be retrieved, including various types of information such as texts, images, music, and more. For the scope of this survey, we concentrate solely on reviewing those text retrieval systems, in which query-document relevance is commonly measured by their matching score.3 Given that IR systems operate on extensive repositories, the efficiency of retrieval algorithms becomes of paramount importance. To improve the user experience, the retrieval performance is enhanced from both the upstream (query reformulation) and downstream (reranking and reading) perspectives. As an upstream technique, query reformulation is designed to refine user queries so that they are more effective at retrieving relevant documents [10, 11]. With the recent surge in the popularity of conversational search, this technique has received increasing attention. On the downstream side, reranking approaches are developed to further adjust the document ranking [12–14]. In contrast to the retrieval stage, reranking is performed only on a limited set of relevant documents,alreadyretrievedbytheretriever.Underthiscir-cumstance, the emphasis is placed on achieving higher performance rather than keeping higher efficiency, allowing for the application of more complex approaches in the reranking process. Additionally, reranking can accommodate other specific requirements, such as personalization [15–18] and diversification [19–22]. Following the retrieval and reranking stages, a reading component is incorporated to summarize the retrieved documents and deliver a concise document to users [23, 24]. While traditional IR systems typically require users to gather and organize relevant information themselves; however, the reading component is an integral

Fig. 1. Overview of existing studies that apply LLMs into information retrieval. LLMs can be used in query rewriter, retriever, reranker, and reader.

part of new IR systems such as New Bing,4 streamlining users’ browsing experience and saving valuable time.

The trajectory of information retrieval (IR) has traversed a dynamic evolution, transitioning from its origins in term-based methods to the integration of neural models. Initially, IR was anchored in term-based methods [25] and Boolean logic, focusing on keyword matching for document retrieval. The paradigm gradually shifted with the introduction of vector space models [26], unlocking the potential to capture nuanced semantic relationships between terms. This progression continued with statistical language models [27, 28], refining relevance estimation through contextual and probabilistic considerations. The influential BM25 algorithm [29] played an important role during this phase, revolutionizing relevance ranking by accounting for term frequency and document length variations. The most recent chapter in IR’s journey is marked by the ascendancy of neural models [3, 30–32]. These models excel at capturing intricate contextual cues and semantic nuances, reshaping the landscape of IR. However, these neural models still face challenges such as data scarcity, interpretability, and the potential generation of plausible yet inaccurate responses. Thus, the evolution of IR continues to be a journey of balancing traditional strengths (such as the BM25 algorithm’s high efficiency) with the remarkable capability (such as semantic understanding) brought about by modern neural architectures.

Large language models(LLMs) have recently emerged as transformative forces across various research fields, such as natural language processing (NLP) [33–35], recommender systems [36–39], finance [40], and even molecule discovery [41]. These cutting-edge LLMs are primarily based on the Transformer architecture and undergo extensive pre-training on diverse textual sources, including web pages, research articles, books, and codes. As their scale continues to expand (including both model size and data volume), LLMs have demonstrated remarkable advances in their capabilities. On the one hand, LLMs have exhibited unprecedented proficiency in language understanding and generation, resulting in responses that are more human-like and better align with human intentions. On the other hand, the larger LLMs have shown impressive emergent abilities when dealing with complex tasks [42], such as generalization and reasoning skills. Notably, LLMs can effectively apply their learned knowledge and reasoning abilities to tackle new tasks with just a few task-specific demonstrations or appropriate instructions [43, 44]. Furthermore, advanced techniques, such as in-context learning, have significantly enhanced the generalization performance of LLMs without requiring fine-tuning on specific downstream tasks [34]. This breakthrough is particularly valuable, as it reduces the need for extensive fine-tuning while attaining remarkable task performance. Powered by prompting strategies such as chain-of-thought, LLMs can generate outputs with step-by-step reasoning, navigating complex decision-making processes [45]. Leveraging the impressive power of LLMs can undoubtedly improve the performance of IR systems. By incorporating these sophisticated language models, IR systems can provide users with more accurate responses, ultimately reshaping the landscape of information access and retrieval.

Initial efforts have been made to utilize the potential of LLMs in the development of novel IR systems. Notably, in terms of practical applications, New Bing is designed to improve the users’ experience of using search engines by extracting information from disparate web pages and condensing it into concise summaries that serve as responses to user-generated queries. In the research community, LLMs have proven useful within specific modules of IR systems (such as retrievers), thereby enhancing the overall performance of these systems. Due to the rapid evolution of LLM-enhanced IR systems, it is essential to comprehensively review their most recent advancements and challenges.

Our survey provides an insightful exploration of the intersection between LLMs and IR systems, covering key perspectives such as query rewriters, retrievers, rerankers, and readers (as shown in Figure 1).5 This analysis enhances our understanding of LLMs’ potential and limitations in advancing the IR field. For this survey, we create a Github repository by collecting the relevant papers and resources about LLM4IR.6 We will continue to update the repository with newer papers. This survey will also be periodically updated according to the development of this area. We notice that there are several surveys for PLMs, LLMs, and their applications (e.g., AIGC or recommender systems) [46– 52]. Among these, we highly recommend the survey of LLMs [52], which provides a systematic and comprehensive reference to many important aspects of LLMs. Compared with them, we focus on the techniques and methods for developing and applying LLMs for IR systems. In addition, we notice a perspective paper discussing the opportunity of IR when meeting LLMs [53]. It would be an excellent supplement to this survey regarding future directions.

The remaining part of this survey is organized as follows: Section 2 introduces the background for IR and LLMs. Section 3, 4, 5, 6 respectively review recent progress from the four perspectives of query rewriter, retriever, reranker, and reader, which are four key components of an IR system. Then, Section 7 discusses some potential directions in future research. Finally, we conclude the survey in Section 8 by summarizing the major findings.

2 BACKGROUND

2.1 Information Retrieval

Information retrieval (IR), as an essential branch of computer science, aims to efficiently retrieve information relevant to user queries from a large repository. Generally, users interact with the system by submitting their queries in textual form. Subsequently, IR systems undertake the task of matching and ranking these user-supplied queries against an indexed database, thereby facilitating the retrieval of the most pertinent results.

The field of IR has witnessed significant advancement with the emergence of various models over time. One such early model is the Boolean model, which employs Boolean logic operators to combine query terms and retrieve documents that satisfy specific conditions [25]. Based on the “bag-of-words” assumption, the vector space model [26] represents documents and queries as vectors in term-based space. Relevance estimation is then performed by assessing the lexical similarity between the query and document vectors. The efficiency of this model is further improved through the effective organization of text content using the inverted index. Moving towards more sophisticated approaches, statistical language models have been introduced to estimate the likelihood of term occurrences and incorporate context information, leading to more accurate and context-aware retrieval [27, 54]. In recent years, the neural IR [30, 55, 56] paradigm has gained considerable attention in the research community. By harnessing the powerful representation capabilities of neural networks, this paradigm can capture semantic relationships between queries and documents, thereby significantly enhancing retrieval performance.

Researchers have identified several challenges with implications for the performance and effectiveness of IR systems, such as query ambiguity and retrieval efficiency. In light of these challenges, researchers have directed their attention towards crucial modules within the retrieval process, aiming to address specific issues and effectuate corresponding enhancements. The pivotal role of these modules in ameliorating the IR pipeline and elevating system performance cannot be overstated. In this survey, we focus on the following four modules, which have been greatly enhanced by LLMs.

Query Rewriter is an essential IR module that seeks to improve both the precision and expressiveness of user queries. Positioned at the early stage of the IR pipeline, this module assumes the crucial role of refining or mod-ifying the initial query to align more accurately with the user’s information requirements. As an integral part of query rewriting, query expansion techniques, with pseudo relevance feedback being a prominent example, represent the mainstream approach to achieving query expression refinement. In addition to its utility in improving search effectiveness across general scenarios, the query rewriter finds application in diverse specialized retrieval contexts, such as personalized search and conversational search, thus further demonstrates its significance.

Retriever, as discussed here, is typically employed in the early stages of IR for document recall. The evolution of retrieval technologies reflects a constant pursuit of more effective and efficient methods to address the challenges posed by ever-growing text collections. In numerous experiments on IR systems over the years, the classical “bag-of-words” model BM25 [29] has demonstrated its robust performance and high efficiency. In the wake of the neural IR paradigm’s ascendancy, prevalent approaches have primarily revolved around projecting queries and documents into high-dimensional vector spaces, and subsequently computing their relevance scores through inner product calculations. This paradigmatic shift enables a more efficient understanding of query-document relationships, leveraging the power of vector representations to capture semantic similarities.

Reranker, as another crucial module in the retrieval pipeline, primarily focuses on fine-grained reordering of documents within the retrieved document set. Different from the retriever, which emphasizes the balance of efficiency and effectiveness, the reranker module places a greater emphasis on the quality of document ranking. In pursuit of enhancing the search result quality, researchers delve into more complex matching methods than the traditional vector inner product, there by furnishing richer matching signals to the reranker. Moreover, the reranker facilitates the adoption of specialized ranking strategies tailored to meet distinct user requirements, such as personalized and diversified search results. By integrating domain-specific objectives, the reranker module can deliver tailored and purposeful search results, enhancing the overall user experience.

Reader has evolved as a crucial module with the rapid development of large language model technologies. Its ability to comprehend real-time user intent and generate dynamic responses based on the retrieved text has revolutionized the presentation of IR results. In comparison to presenting a list of candidate documents, the reader module organizes answer texts in a more intuitive manner, simulating the natural way humans access information. To enhance the credibility of generated responses, the integration of references into generated responses has been an effective technique of the reader module.

2.2 Large Language Models

Language models (LMs) are designed to calculate the generative likelihood of word sequences by taking into ac-count the contextual information from preceding words, thereby predicting the probability of subsequent words. Consequently, by employing certain word selection strategies (such as greedy decoding or random sampling), LMs can proficiently generate natural language texts. Although the primary objective of LMs lies in text generation, recent studies [57] have revealed that a wide array of natural language processing problems can be effectively reformulated into a text-to-text format, thus rendering them amenable to resolution through text generation. This has led to LMs becoming the de facto solution for the majority of text-related problems.

The evolution of LMs can be categorized into four primary stages, as discussed in prior literature [52]. Initially, LMs were rooted in statistical learning techniques and were termed statistical language models. These models tackled the issue of word prediction by employing the Markov assumption to predict the subsequent word based on preceding words. Thereafter, neural networks, particularly recurrent neural networks (RNNs), were introduced to calculate the likelihood of text sequences and establish neural language models. These advancements made it feasible to utilize LMs for representation learning beyond mere word sequence modeling. ELMo [58] first proposed to learn contextualized word representations through pre-training a bidirectional LSTM (biLSTM) network on large-scale corpora, followed by fine-tuning on specific down-stream tasks. Similarly, BERT [59] proposed to pre-train a Transformer [60] encoder with a specially designed Masked Language Modeling (MLM) task and Next Sentence Prediction (NSP) task on large corpora. These studies initiated a new era of pre-trained language models (PLMs), with the “pre-training then fine-tuning” paradigm emerging as the prevailing learning approach. Along this line, numerous generative PLMs (e.g., GPT-2 [33], BART [61], and T5 [57]) have been developed for text generation problems including summarization, machine translation, and dialogue generation. Recently, researchers have observed that increasing the scale of PLMs (e.g., model size or data amount) can consistently improve their performance on downstream tasks (a phenomenon commonly referred to as the scaling law [62, 63]). Moreover, large-sized PLMs exhibit promising abilities (termed emergent abilities [42]) in addressing complex tasks, which are not evident in their smaller counterparts. Therefore, the research community refers to these large-sized PLMs large language models (LLMs).

As shown in Figure 2, existing LLMs can be categorized into two groups based on their architectures: encoder-decoder [57, 61, 64–69] and decoder-only [33–35, 70–80] models. The encoder-decoder models incorporate an encoder component to transform the input text into vectors, which are then employed for producing output texts. For example, T5 [57] is an encoder-decoder model that converts each natural language processing problem into a text-to-text form and resolves it as a text generation problem. In contrast, decoder-only models, typified by GPT, rely on the Transformer decoder architecture. It uses a self-attention

Fig. 2. The evolution of LLMs (encoder-decoder and decoder-only structures).

mechanism with a diagonal attention mask to generate a sequence of words from left to right. Building upon the success of GPT-3 [34], which is the first model to encompass over 100B parameters, several noteworthy models have been inspired, including GPT-J, BLOOM [78], OPT [75], Chinchilla [81], and LLaMA [35]. These models follow the similar Transformer decoder structure as GPT-3 and are trained on various combinations of datasets.

Owing to their vast number of parameters, fine-tuning LLMs for specific tasks, such as IR, is often deemed impractical. Consequently, two prevailing methods for applying LLMs have been established: in-context learning (ICL) and parameter-efficient fine-tuning. ICL is one of the emergent abilities of LLMs [34] empowering them to comprehend and furnish answers based on the provided input context, rather than relying merely on their pre-training knowledge. This method requires only the formulation of the task description and demonstrations in natural language, which are then fed as input to the LLM. Notably, parameter tuning is not required for ICL. Additionally, the efficacy of ICL can be further augmented through the adoption of chain-of-thought prompting, involving multiple demonstrations (describe the chain of thought examples) to guide the model’s reasoning process. ICL is the most commonly used method for applying LLMs to IR. Parameter-efficient fine-tuning [82–84] aims to reduce the number of trainable parameters while maintaining satisfactory performance. LoRA [82], for example, has been widely applied to open-source LLMs (e.g., LLaMA and BLOOM) for this purpose. Recently, QLoRA [85] has been proposed to further reduce memory usage by lever-aging a frozen 4-bit quantized LLM for gradient computation. Despite the exploration of parameter-efficient fine-tuning for various NLP tasks, its implementation in IR tasks remains relatively limited, representing a potential avenue for future research.

Fig. 3. An example of LLM-based query rewriting for ad-hoc search. The example is cited from the Query2Doc paper [86]. LLMs are used to generate a passage to supplement the original query, where N = 0 and N > 0 correspond to zero-shot and few-shot scenarios.

3 QUERY REWRITER

In traditional IR systems, the user inputs a query, and the system returns a list of documents matching the query terms. However, original queries are often short or ambiguous, making them more susceptible to the problem of vocabulary mismatch. For example, if a user inputs the query “automobile” into a search engine, they expect to find information about cars. Nevertheless, the majority of relevant documents in the search results use the term “vehicle” rather than “automobile”. Since the search engine primarily relies on the exact query terms, it may not retrieve the most relevant documents containing the term “vehicle”, leading to a vocabulary mismatch challenge. Furthermore, in modern forms of IR systems, such as conversational search, the task of query rewriting plays an even more crucial role.

Traditional query rewriting methods have been developed to tackle this challenge by iteratively refining the user’s original query based on an analysis of the top-ranked documents retrieved by the initial query. These methods have demonstrated their effectiveness in improving retrieval accuracy by iteratively refining the query representation and incorporating additional information from the document collection. Notable examples of such methods include RM3 [87], LCE [88], KL expansion [89], and relevance modeling [90]. However, these methods often rely on predefined rules or heuristics, limiting their ability to fully capture the subtle nuances of user intent.

Fortunately, recent advancements in LLMs present promising opportunities to augment query rewriting capabilities and overcome the limitations of traditional approaches. These powerful language models have the potential to leverage vast amounts of textual data and learn more nuanced representations of queries and documents. By leveraging the capabilities of these models, researchers can explore novel approaches to query rewriting that can better align with the complex and diverse information needs of

Fig. 4. An example of LLM-based query rewriting for conversational search. The example is cited from LLMCS [91]. The LLM is used to generate a query based on the demonstrations and previous search context. Additional responses are required to generate for improving the query under-standing. N = 0 and N > 0 correspond to zero-shot and few-shot scenarios.

users. In the following sections, we will introduce the details of employing LLMs in query rewriting.

3.1 Re-writing Target

Query rewriting typically serves two scenarios: adhoc retrieval, which addresses vocabulary mismatches between queries and documents, and conversational search, which refines and adapts system responses based on evolving conversations. In the following section, we will elaborate on the motivations behind these pursuits.

3.1.1 Ad-hoc Retrieval

Query rewriting plays a crucial role in ad-hoc retrieval as it serves as the initial step in the search funnel, and the overall effectiveness of the search process highly relies on the quality of query rewriting results. Within the context of ad-hoc retrieval, the primary objective of query rewriting is to retrieve a collection of documents that align more closely with the user’s information needs. The utilization of large language models (LLMs) in query rewriting offers several notable advantages, which can be summarized as follows:

• Better Semantic Understanding. LLMs have a deep understanding of language semantics, allowing them to capture the meaning and context of queries more effectively. As a comparison, traditional query rewriting methods primarily rely on statistical analysis of term frequencies [87–89], which may not capture the real intent of the query.

• Broad Knowledge. LLMs possess extensive knowledge, allowing them to draw from a wide range of concepts, facts, and information. This knowledge enables them to generate relevant and contextually appropriate query rewrites by leveraging their understanding of various topics.

• Independence of First-pass Retrieval. Traditional pseudo relevance feedback (PRF) methods retrieve a set of pseudo relevant documents as a source to refine the original query. However, the presence of irrelevant results in the pseudo relevance feedback set can introduce noise and potentially damage the retrieval performance. In contrast, LLMs could generate query rewrites directly based on the original query, which is independent of the first-pass retrieval and prevents the potential noise.

Query2Doc [86] is an LLM-base query rewriter, which prompts the LLMs to generate a relevant passage (which is equal to generate answers to some extent) according to the query. Subsequently, the original query is expanded by incorporating the generated passage. As shown in Figure 3, the original query written by the user is “when was Poke-mon Green released”. The LLM generates a relevant passage of the original query by employing a few-shot prompting paradigm. Then, the original query and the generated pas-sage are combined to construct a new query. The retriever module uses this new query to retrieve a list of relevant documents. Notably, the generated passage contains additional detailed information, such as “Pokemon Green was released in Japan on February 27th”, which effectively mitigates the vocabulary mismatch issue to some extent.

3.1.2 Conversational Search

Conversational search, on the other hand, involves a dynamic interaction between users and the retrieval system, where the system responds to user queries and clarifies the user’s information needs through a series of back-and-forth exchanges. In this context, query rewriting serves the purpose of refining and adapting the system’s responses based on the evolving conversation. The system may need to rewrite the user’s queries or generate new queries to retrieve relevant information that aligns with the ongoing conversation.

In the era of LLMs, leveraging LLMs in conversational search tasks offers several advantages. First, LLMs possess strong contextual understanding capabilities, enabling them to better comprehend users’ search intent within the context of multi-turn conversations between users and the system. Second, LLMs exhibit powerful generation abilities, allowing them to simulate dialogues between users and the system, thereby facilitating more robust search intent modeling.

To leverage the contextual understanding and generation capabilities of LLMs in conversational search, a frame-work called LLMCS (“Large Language Models know your Contextual Search intent”) has been proposed [91]. As shown in Figure 4, LLMCS prompts LLMs to generate query rewrites and longer hypothetical system responses from multiple perspectives. These generated outputs are then aggregated into an integrated representation that robustly captures the user’s complete search intent. Experimental results demonstrate that incorporating hypothetical responses alongside concise query rewrites significantly enhances search performance by explicitly supplementing more plausible search intent.

3.2 Data Resources

Query rewriting methods commonly require supplementary corpora to enrich the original queries. LLMs inherently in-corporate world knowledge through their parameters. However, such general knowledge might prove inadequate for specific domains, thus necessitating domain-specific corpora to provide more comprehensive and domain-specific information. In this section, we will analyze two approaches: LLM-only methods that rely solely on the pre-existing knowledge within the model and corpus-enhanced LLM-based methods that leverage domain-specific corpora to augment the LLM’s capabilities.

3.2.1 Inherent Knowledge of LLMs

LLMs are capable of storing knowledge within their parameters, making it a natural choice to capitalize on this knowledge for the purpose of query rewriting. As a pioneering work in LLM-based query rewriting, HyDE [92] generates a hypothetical document by LLMs according to the given query and then uses a dense retriever to retrieve relevant documents from the corpus that are relevant to the generated document. Query2doc [86] generates pseudo documents via prompting LLMs with few-shot demonstrations, and then expands the query with the generated pseudo document. Furthermore, the influence of different prompting methods and various model sizes on query rewriting is also investigated [93]. To better accommodate the frozen retriever and the LLM-based reader, a small language model is employed as the rewriter that is trained using reinforcement learning techniques with the rewards provided by the LLM-based reader [94]. It is worth noting that all of these works rely on the knowledge stored in LLMs rather than additional corpora. Despite this, experimental results demonstrate significant improvements compared to traditional query rewrite methods such as RM3 [87].

3.2.2 Inherent Knowledge of LLMs and Document Corpus

Although LLMs exhibit remarkable capabilities, their lack of familiarity with specific domains can lead to the generation of hallucinatory or irrelevant queries. To address this issue, recent studies [95–98] have proposed a hybrid approach that enhancing LLM-base query rewriting methods with document corpus.

Why incorporating a document corpus? The integration of a document corpus offers several notable advantages. First, it provides domain-specific knowledge by fine-tuning the query generation process for specific subject areas, enabling a targeted and specialized approach to IR. For instance, LLMs may not possess immediate comprehension of the query “ADHD”, but the examination of retrieved documents can offer valuable insights indicating its association with a medical condition referred to as “Attention-Deficit/Hyperactivity Disorder”. Second, it ensures the grounding of queries in reliable and verifiable knowledge by incorporating factual information extracted from the corpus. Third, it incorporates up-to-date concepts by enriching queries with contemporary information, surpassing the knowledge contained within LLMs, and ensuring enhanced information richness and timeliness. Finally, it supports relevance by utilizing relevant documents as a supplementary resource to refine the query generation process, effectively reducing the generation of irrelevant information and improving the production of contextually relevant outputs.

How to incorporate a document corpus? In light of these advantages, various paradigms have been proposed to in-corporate a document corpus into LLM-based query rewriting, which we summarize as follows.

• Late fusion of LLM-only re-writing and traditional PRF retrieval results. Traditional PRF methods leverage relevant documents retrieved from a document corpus to rewrite queries, which restricts the query to the information contained in the target corpus. On the contrary, LLM-only re-writing methods provides external context not present in the corpus, which is more diverse. Both approaches have potential to independently enhance retrieval performance. Therefore, a straightforward strategy for combining them is using a weighted fusion method on retrieval results, which is demonstrated effective [98].

• Combining retrieved relevant documents in the prompts of LLMs. In the era of LLMs, incorporating instructions within the prompts is a flexible method for achieving specific functionalities. There are several studies that leverage this procedure. For the query intent classification task, QUILL [96] demonstrates that retrieval augmentation of queries pro-vides LLMs with meaningful supportive context, which leads to better understanding. It proposes a two-stage distill pipeline to transfer the knowledge in LLMs to small models in order to rewrite queries more efficiently. LameR [99] proposes a retrieve-rewrite-retrieve framework. It takes in BM25 as the retriever without using any annotated query-document pairs, which shows the great potential of LLMs in zero-shot retrieval. Besides, there are no strict rules about whether rewriters should appear before retrievers. InteR [97] proposes a framework supporting multi-turn in-teraction between search engines and LLMs. This setup enables search engines to expand queries with LLM-generated knowledge, while simultaneously permitting LLMs to refine prompt formulations with relevant documents provided by search engines.

• Enhancing factuality of generative relevance feedback (GRF) by pseudo relevance feedback (PRF). Although generative documents are often relevant and diverse, they exhibit hallucinatory characteristics. In contrast, traditional documents are generally regarded as reliable sources of factual information. Motivated by this observation, GRM [95] proposes a novel technique known as relevance-aware sample estimation (RASE). RASE leverages relevant documents retrieved from the collection to assign weights to generated documents. In this way, GRM ensures that relevance feedback is not only diverse but also maintain a high degree of factuality.

TABLE 1. Examples of different prompting methods in query rewriting.

3.3 Generation Methodology

There are three primary paradigms employed to leverage LLMs for the task of query rewriting. The most prevalent paradigm is known as prompting methods, in which a specific prompt or instruction is designed to guide the language model in generating the desired output. Prompting methods offer researchers flexibility and interpretability over the rewriting process. In addition to prompting methods, fine-tuning LLMs in domain-specific areas and knowledge distillation are also effective methods. LLM contains world knowledge, which is general but may be ineffective for specific domains. Fine-tuning entails training a pre-trained LLM on a specific dataset or task in order to enhance its domain-specific effectiveness. Retrieval augmentation causes in-creased complexity of LLM inference. Knowledge distillation alleviates this problem by transferring the knowledge of LLMs to some lightweight models. In the following section, we will introduce these three methods in detail.

3.3.1 Prompting Methods

Prompting in LLMs refers to the technique of providing a specific instruction or context to guide the model’s generation of text. The form of prompts can vary, ranging from questions and incomplete sentences to specific instructions. The prompt serves as a conditioning signal and influences the language generation process of the model. Consequently, the effectiveness of prompting is dependent on the quality and design of the prompt.

The prevailing prompting strategies generally en-compass three categories: zero-shot prompting, few-shot prompting, and chain-of-thought (CoT) prompting [45].

• Zero-shot prompting. Zero-shot prompting involves in-structing the model to generate text on a specific topic without any prior exposure to training examples in that domain or topic. The model relies on its pre-existing knowledge and language understanding to generate coherent and contextually relevant expanded terms for original queries. Experiments show that zero-shot prompting is a simple but effective method for query rewriting [93, 97–101].

• Few-shot prompting. Few-shot prompting, also known as in-context learning, involves providing the model with a limited set of examples or demonstrations related to the desired task or domain [86, 93, 100, 101]. These examples serve as a form of explicit instruction, allowing the model to adapt its language generation to the specific task or domain at hand. Query2Doc [86] prompts LLMs to write a document that answers the query with some demo query-document pairs provided by the ranking dataset, such as MSMARCO [102] and NQ [103]. This work experiment on a single prompt. To further study the impact of different prompt designing, [93] explores eight different prompts, such as promting LLMs to generate query expansion terms instead of entire pseudo documents, CoT prompting and so on. There are some illustrative prompts in Table 1. This work conducts more experiments than Query2Doc, but experiments show that this prompt is not as effective as that in Query2Doc.

• Chain-of-thought prompting. CoT prompting [45] is a strategy that involves iterative prompting, where the model is provided with a sequence of instructions or partial out-puts [93, 100]. In conversational search, the process of query re-writing is multi-turn, which means query should be refined step-by-step with the interaction between search engine and users. This process is naturally coincided with CoT process. As shown in 4, engineers conduct the CoT process through adding some instructions during each turn, such as “Based on all previous turns, xxx”. While in ad-hoc search, there is only one-round in query re-writing. So CoT could only be accomplished in a simple and coarse way. For example, as shown in Table 1, researchers add “Give the rationale before answering” in the instructions to prompt LLMs think deeply [93].

3.3.2 Fine-tuning Methods

Fine-tuning is another effective and prevalent paradigm to let LLMs adapt to the specific domain better. The process of fine-tuning typically involves taking a pre-trained LM, such as GPT-3, and further training it on a dataset that is specific to the target domain. The dataset may consist of queries, rewrites, and associated labels, or it may be generated through a combination of human expertise and data augmentation techniques. During fine-tuning, the model’s parameters are adjusted to optimize its performance on the domain-specific task.

By training the LLM on domain-specific data, it can learn domain-specific patterns, terminology, and context, which can enhance its ability to generate high-quality query rewrites. However, fine-tuning LLMs could be quite expansive. There are several studies trying to fine-tune LLM on query rewriting tasks. A typical study fine-tunes LLMs to generate relevant documents of the given query [101], and combines query with the generative documents as new query. Concretely, it fine-tunes FiD [104] (a variant of T5) with 770M and 3B parameters on downstream datasets such as NQ [103] and FEVER [105]. Experimental results show that fine-tuning the 770M FiD model with an input length of 550 tokens can be easily performed on a single Tesla V100

TABLE2.SummaryofexistingLLM-enhancedqueryrewrit-ing methods. “Docs” and “KD” stand for document corpus and knowledge distillation, respectively.

32GB GPU, while training the 3B T5 model requires a larger cluster consisting of eight Tesla V100 or A100 GPUs.

3.3.3 Knowledge Distillation Methods

Although LLM-based methods have demonstrated significant improvements in query rewriting tasks, their practical implementation for online deployment is hindered by the substantial latency caused by the computational requirements of LLMs. To mitigate this challenge, knowledge distillation has emerged as a prominent technique in the industry. In the QUILL [96] framework, a two-stage distillation method is proposed. This approach entails utilizing a retrieval-augmented LLM as the professor model, a vanilla LLM as the teacher model, and a lightweight BERT model as the student model. The professor model is trained on two extensive datasets, namely Orcas-I [106] and EComm [96], which are specifically curated for query intent understanding. Subsequently, a two-stage distillation process is employed to transfer knowledge from the professor model to the teacher model, followed by knowledge transfer from the teacher model to the student model. Empirical findings demonstrate that this knowledge distillation methodology surpasses the simple scaling up of model size from base to XXL, resulting in even more substantial improvements. In a recently proposed “rewrite-retrieve-read” framework [94], an LLM is first used to rewrite the queries by prompting, followed by a retrieval-augmented reading process. To improve framework effectiveness, a trainable rewriter, implemented as a small language model, is incorporated to further adapt search queries to align with both the frozen retriever and the LLM reader’s requirements. The rewriter’s refinement involves a two-step training process. Initially, supervised warm-up training is conducted using pseudo data. Subsequently, the retrieve-then-read pipeline is described as a reinforcement learning scenario, with the rewriter’s training acting as a policy model to maximize pipeline performance rewards.

3.4 Limitations

Experiments across various query rewrite methods consistently highlight the positive impact of increasing the number of generated documents or answers on retrieval performance [99]. While candidate documents contribute to presenting supplementary contexts of the original query, most existing works emphasize the original query’s prominence through repetition or alternative strategies [93, 97, 99].

According to our knowledge, there is a lack of dedicated metrics designed for evaluating query rewriting. Existing models heavily rely on downstream retrieval tasks to assess the effectiveness of query rewriters. How to directly determine whether there written queries reflect human intent and effectively serve specific tasks remains unresolved.

4 RETRIEVER

In an IR system, the retriever serves as the first-pass document filter to collect broadly relevant documents for user queries. Given the enormous amounts of documents in an IR system, the retriever’s efficiency in locating relevant documents is essential for maintaining search engine performance. Meanwhile, a high recall is also important for the retriever, as the retrieved documents are then fed into the ranker to generate final results for users, which determines the ranking quality of search engines.

In recent years, retrieval models have shifted from relying on statistic algorithms [29] to neural models [3, 31]. The latter approaches exhibit superior semantic capability and excel at understanding complicated user intent. The success of neural retrievers relies on two key factors: data and model. From the data perspective, a large amount of high-quality training data is essential. This enables retrievers to acquire comprehensive knowledge and accurate matching patterns. Furthermore, the intrinsic quality of search data, i.e., issued queries and document corpus, significantly influences retrieval performance. From the model perspective, a strongly representational neural architecture allows retrievers to effectively store and apply knowledge obtained from the training data.

Unfortunately, there are some long-term challenges that hinder the advancement of retrieval models. First, user queries are usually short and ambiguous, making it difficult to precisely understand the user’s search intents for retrievers. Second, documents typically contain lengthy content and substantial noise, posing challenges in encoding long documents and extracting relevant information for retrieval models. Additionally, the collection of human-annotated relevance labels is time-consuming and costly. It restricts the retrievers’ knowledge boundaries and their ability to generalize across different application domains. Moreover, existing model architectures, primarily built on BERT [59], exhibit inherent limitations, thereby constraining the performance potential of retrievers. Recently, LLMs have exhibited extraordinary abilities in language understanding, text generation, and reasoning. This has motivated researchers to use these abilities to tackle the aforementioned challenges and aid in developing superior retrieval models. Roughly, these studies can be categorized into two groups, i.e., (1) leveraging LLMs to generate search data, and (2) employing LLMs to enhance model architecture.

4.1 Leveraging LLMs to Generate Search Data

In light of the quality and quantity of search data, there are two prevalent perspectives on how to improve retrieval performance via LLMs. The first perspective revolves around search data refinement methods, which concentrate on re-formulating input queries to precisely present user intents.

The second perspective involves training data augmentation methods, which leverage LLMs’ generation ability to enlarge the training data for dense retrieval models, particularly in zero- or few-shot scenarios.

4.1.1 Search Data Refinement

Typically, input queries consist of short sentences or keyword-based phrases that may be ambiguous and contain multiple possible user intents. Accurately determining the specific user intent is essential in such cases. Moreover, documents usually contain redundant or noisy information, which poses a challenge for retrievers to extract relevance signals between queries and documents. Leveraging the strong text understanding and generation capabilities of LLMs offers a promising solution to these challenges. As yet, research efforts in this domain primarily concentrate on employing LLMs as query rewriters, aiming to refine input queries for more precise expressions of the user’s search intent. Section 3 has provided a comprehensive overview of these studies, so this section refrains from further elaboration. In addition to query rewriting, an intriguing avenue for exploration involves using LLMs to enhance the effectiveness of retrieval by refining lengthy documents. This intriguing area remains open for further investigation and advancement.

4.1.2 Training Data Augmentation

Due to the expensive economic and time costs of human-annotated labels, a common problem in training neural retrieval models is the lack of training data. Fortunately, the excellent capability of LLMs in text generation offers a potential solution. A key research focus lies in devising strategies to leverage LLMs’ capabilities to generate pseudo-relevant signals and augment the training dataset for the retrieval task.

Why do we need data augmentation? Previous studies of neural retrieval models focused on supervised learning, namely training retrieval models using labeled data from specific domains. For example, MS MARCO [102] pro-vides a vast repository, containing a million passages, more than 200,000 documents, and 100,000 queries with human-annotated relevance labels, which has greatly facilitated the development of supervised retrieval models. However, this paradigm inherently constrains the retriever’s generalization ability for out-of-distribution data from other domains. The application spectrum of retrieval models varies from natural question-answering to biomedical IR, and it is ex-pensive to annotate relevance labels for data from different domains. As a result, there is an emerging need for zero-shot and few-shot learning models to address this problem [114]. A common practice to improve the models’ effectiveness in a target domain without adequate label signals is through data augmentation.

How to apply LLMs for data augmentation? In the scenario of information retrieval, it is easy to collect numerous documents. However, the challenging and costly task lies in gathering real user queries and labeling the relevant documents accordingly. Considering the strong text generation capability of LLMs, many researchers [107,108] suggest using LLM-driven processes to create pseudo queries or relevance labels

Fig. 5. Two typical frameworks for LLM-based data augmentation in the retrieval task (right), along with their prompt examples (left). Note that the methods of relevance label generation do not treat questions as inputs but regard their generation probabilities conditioned on the retrieved passages as soft relevance labels.

TABLE 3. The comparison of existing data augmentation methods powered by LLMs for training retrieval models.

based on existing collections. These approaches facilitate the construction of relevant query-document pairs, enlarging the training data for retrieval models. According to the type of generated data, there are two main stream approaches that complement the LLM-based data augmentation for retrieval models, i.e., pseudo query generation and relevance label generation. Their frameworks are visualized in Figure 5. Next, we will give an overview of the related studies.

• Pseudo query generation. Given the abundance of documents, a straightforward idea is to use LLMs for generating their corresponding pseudo queries. One such illustration is presented by in Pairs [107], which leverages the in-context learning capability of GPT-3. This method employs a col-lection of query-document pairs as demonstrations. These pairs are combined with a document and presented as input to GPT-3, which subsequently generates possible relevant queries for the given document. By combining the same demonstration with various documents, it is easy to create a vast pool of synthetic training samples and support the fine-tuning of retrievers on specific target domains. To enhance the reliability of these synthetic samples, a fine-tuned model (e.g., a monoT5-3B model fine-tuned on MSMARCO [108]) is employed to filter the generated queries. Only the top pairs with the highest estimated relevance scores are kept for training. This “generating-then-filtering” paradigm can be conducted iteratively in a round-trip filtering manner, i.e., by first fine-tuning a retriever on the generated samples and then filtering the generated samples using this retriever. Repeating these EM-like steps until convergence can produce high-quality training sets [109]. Furthermore, by adjusting the prompt given to LLMs, they can generate queries of different types. This capability allows for a more accurate simulation of real queries with various patterns [110].

In practice, it is costly to generate a substantial number of pseudo queries through LLMs. Balancing the generation costs and the quality of generated samples has become an urgent problem. To tackle this, UDAPDR [111] is proposed, which first produces a limited set of synthetic queries using LLMs for the target domain. These high-quality examples are subsequently used as prompts for a smaller model to generate a large number of queries, thereby constructing the training set for that specific domain. It is worth noting that the aforementioned studies primarily rely on fixed LLMs with frozen parameters. Empirically, optimizing LLMs’ parameters can significantly improve their performance on downstream tasks. Unfortunately, this pursuit is impeded by the prohibitively high demand for computational re-sources. To overcome this obstacle, SPTAR [112] introduces a soft prompt tuning technique that only optimizes the prompts’ embedding layer during the training process. This approach allows LLMs to better adapt to the task of generating pseudo-queries, striking a favorable balance between training cost and generation quality.

• Relevance label generation. In some downstream tasks of retrieval, such as question-answering, the collection of

questions is also sufficient. However, the relevance labels connecting these questions with the passages of supporting evidence are very limited. In this context, leveraging the capability of LLMs for relevance label generation is a promising approach that can augment the training corpus for retrievers. A recent method, ART [113], exemplifies this approach. It first retrieves the top-relevant passages for each question. Then, it employs an LLM to produce the generation probabilities of the question conditioned on these top passages. After a normalization process, these probabilities serve as soft relevance labels for the training of the retriever. Additionally, to highlight the similarities and differences among the corresponding methods, we present a comparative result in Table 3. It compares the aforementioned methods from various perspectives, including the number of examples, the generator employed, the type of synthetic data produced, the method applied to filter synthetic data, and whether LLMs are fine-tuned. This table serves to facilitate a clearer understanding of the landscape of these methods.

Task-aware Retriever. While the aforementioned studies primarily focus on using LLMs as text embedding models for downstream retrieval tasks, retrieval performance can be greatly enhanced when task-specific instructions are integrated. For example, TART [117] devises a task-aware retrieval model that introduces a task-specific instruction before the question. This instruction includes descriptions of the task’s intent, domain, and desired retrieved unit. For instance, given that the task is question-answering, an effective prompt might be “Retrieve a Wikipedia text that answers this question. {question}”. Here, “Wikipedia” (do-main) indicates the expected source of retrieved documents, “text” (unit) suggests the type of content to retrieve, and “answers this question” (intent) demonstrates the intended relationship between the retrieved texts and the question. This approach can take advantage of the powerful language modeling capability and extensive knowledge of LLMs to precisely capture the user’s search intents across various retrieval tasks.

4.2.2 Generative Retriever

4.2 Employing LLMs to Enhance Model Architecture

Leveraging the excellent text encoding and decoding capabilities of LLMs, it is feasible to understand queries and documents with greater precision compared to earlier smaller-sized models [59]. Researchers have endeavored to utilize LLMs as the foundation for constructing advanced retrieval models. These methods can be grouped into two categories, i.e., encoder-based retrievers and generative retrievers.

4.2.1 Encoder-based Retriever

In addition to the quantity and quality of the data, the representative capability of models also greatly influences the efficacy of retrievers. Inspired by the LLM’s excellent capability to encode and comprehend natural language, some researchers [115–117] leverage LLMs as retrieval encoders and investigate the impact of model scale on retriever performance.

General Retriever. Since the effectiveness of retrievers primarily relies on the capability of text embedding, the evolution of text embedding models often has a significant impact on the progress of retriever development. In the era of LLMs, a pioneer work is made by OpenAI [115]. They view the adjacent text segments as positive pairs to facilitate the unsupervised pre-training of a set of text embedding models, denoted as cpt-text, whose parameter values vary from 300M to 175B. Experiments conducted on the MS MARCO [102] and BEIR [114] datasets indicate that larger model scales have the potential to yield improved performance in unsupervised learning and transfer learning for text search tasks. Nevertheless, pre-training LLMs from scratch is prohibitively expensive for most researchers. To overcome this limitation, T5-family models with smaller sizes, such as Base, Large, XL, and XXL, are used to initialize the model parameters of bi-encoders, which are then fine-tuned on retrieval datasets [116]. The experimental results confirm again that larger model sizes can lead to better performance, particularly in zero-shot settings.

Traditional IR systems typically follow the “index-retrieval-rank” paradigm to locate relevant documents based on user queries, which has proven effective in practice. However, these systems usually consist of three separate modules: the index module, the retrieval module, and the reranking module. Therefore, optimizing these modules collectively can be challenging, potentially resulting in sub-optimal retrieval outcomes. Additionally, this paradigm demands additional space for storing pre-built indexes, further burdening storage resources. Recently, model-based generative retrieval methods [118–120] have emerged to address these challenges. These methods move away from the traditional “index-retrieval-rank” paradigm and instead use a unified model to directly generate document identifiers (i.e., Do-cIDs) relevant to the queries. In these model-based generative retrieval methods, the knowledge of document corpus is stored in the model parameters, eliminating the need for additional storage space for the index. Existing methods have explored generating document identifiers through fine-tuning and prompting of LLMs [121, 122]

Fine-tuning LLMs. Given the vast amount of world knowledge contained in LLMs, it is intuitive to leverage them for building model-based generative retrievers. DSI [121] is a typical method that fine-tunes the pre-trained T5 models on retrieval datasets. The approach involves encoding queries and decoding document identifiers directly to perform retrieval. They explore multiple techniques for generating document identifiers and find that constructing semantically structured identifiers yields optimal results. In this strategy, DSI applies hierarchical clustering to group documents ac-cording to their semantic embeddings and assigns a semantic DocID to each document based on its hierarchical group. To ensure the output DocIDs are valid and do represent actual documents in the corpus, DSI constructs a trie using all DocIDs and utilizes a constraint beam search during the decoding process. Furthermore, this approach observes that the scaling law, which suggests that larger LMs lead to improved performance, is also applied to generative retrievers.

TABLE 4. The comparison of retrievers that leverage LLMs as the foundation.

Prompting LLMs. In addition to fine-tuning LLMs for retrieval, it has been found that LLMs(e.g.,GPT-series models) can directly generate relevant web URLs for user queries with a few in-context demonstrations [122]. This unique capability of LLMs is believed to arise from their training exposure to various HTML resources. As a result, LLMs can naturally serve as generative retrievers that directly generate document identifiers to retrieve relevant documents for input queries. To achieve this, an LLM-URL [122] model is proposed. It utilizes the GPT-3 text-davinci-003 model to yield candidate URLs. Furthermore, it designs regular expressions to extract valid URLs from these candidates to locate the retrieved documents.

To provide a comprehensive understanding of this topic, Table 4 summarizes the common and unique characteristics of the LLM-based retrievers discussed above.

4.3 Limitations

Though some efforts have been made for LLM-augmented retrieval, there are still many areas that require more de-tailed investigation. For example, a critical requirement for retrievers is fast response, while the main problem of existing LLMs is the huge model parameters and overlong inference time. Addressing this limitation of LLMs to ensure the response time of retrievers is a critical task. Moreover, even when employing LLMs to augment datasets (a context with lower inference time demands), the potential mismatch between LLM-generated texts and real user queries could impact retrieval effectiveness. Furthermore, as LLMs usually lack domain-specific knowledge, they need to be fine-tuned on task-specific datasets before applying them to downstream tasks. Therefore, developing efficient strategies to fine-tune these LLMs with numerous parameters emerges as a key concern.

5 RERANKER

Reranker, as the second-pass document filter in IR, aims to rerank a document list retrieved by the retriever (e.g., BM25) based on the query-document relevance. With respect to the application of LLMs, the existing reranking methods can be divided into three primary paradigms: fine-tuning LLMs for reranking, prompting LLMs for reranking, and utilizing LLMs for training data augmentation. These paradigms will be elaborated upon in the following sections. Recall that we will use the term document to refer to the text retrieved in general IR scenarios, including instances such as passages (e.g., passages in MS MARCO passage ranking dataset [102]).

5.1 Fine-tuning LLMs for Reranking

Fine-tuning is an important step in applying pre-trained LLMs to a reranking task. Due to the lack of awareness of ranking during pre-training, LLMs cannot appropriately measure the query-document relevance and fully under-stand the reranking tasks. By fine-tuning LLMs on task-specific ranking datasets, such as the MS MARCO passage ranking dataset [102], which includes signals of both relevance and irrelevance, LLMs can adjust their parameters to yield better performance in the reranking tasks. In general, existing fine-tuning strategies for LLMs can be categorized into two principal approaches: (1) fine-tuning LLMs as generation models and (2) fine-tuning LLMs as ranking models.

5.1.1 Fine-tuning LLMs as Generation Models

In this field, existing studies usually formulate document ranking as a generation task and optimize LLMs by generative loss [13, 123, 124]. Specifically, reranking models are usually fine-tuned to generate a single token, such as “true” or “false”, given the query and documents. During inference, the query-document relevance score is determined based on the logit of the generated token. For example, a T5 model can be fine-tuned to generate classification tokens for relevant or irrelevant query-document pairs [13]. At inference time, a softmax function is applied to the logits of “true” and “false” tokens, and the relevance score is calculated as the probability of the “true” token. The following method [123] involves a multi-view learning approach based on the T5 model. This approach simultaneously considers two tasks: generating classification tokens for a given query-document pair and generating the corresponding query conditioned on the provided document. DuoT5 [124] considers a triple (q,di,dj) as the input of the T5 model and is fine-tuned to generate token “true” if document di is more relevant to query qi than document dj, and “false” otherwise. During inference, for each document di, it enumerates all other documents dj and uses global aggregation functions to generate the relevance score si for document di (e.g., si =         pi,j, where pi,j represents the probability of generating “true” when taking (q,di,dj) as the model input).

5.1.2 Fine-tuning LLMs as Ranking Models

Although the methods of fine-tuning LLMs as generation models outperform some strong ranking baselines, they are not optimal for reranking tasks. This stems from two primary reasons. First, it is commonly expected that a reranking model will yield a numerical relevance score for each query-document pair rather than text tokens. Second, compared to generation losses, it is more reasonable to optimize the reranking model using ranking losses (e.g., RankNet [125]). While some pre-trained models, such as BERT, have been explored for document reranking, the use of seq2seq-based LLMs, such as T5-3B, for reranking tasks has not yet been thoroughly investigated. Recently, RankT5 [126] directly calculates the relevance score for a query-document pair and optimizing the ranking performance with “pairwise” or “listwise” ranking losses. An avenue for potential performance enhancement lies in the substitution of the base-sized T5 model with its larger-scale counterpart.

Fig. 6. Three types of prompting-based reranking methods: (a) pointwise methods that consist of relevance generation (upper) and query generation (lower), (b) listwise methods, and (c) pairwise methods.

5.2 Prompting LLMs for Reranking

As the size of LLMs scales up (e.g., exceeding 10 billion parameters), it becomes increasingly difficult to fine-tune the reranking model. Addressing this challenge, recent efforts have attempted to leverage the exceptional instructional capabilities of LLMs to directly enhance document reranking via prompting strategies. In general, these prompting strategies for reranking can be divided into three categories: pointwise, listwise, and pairwise methods. A comprehensive exploration of these strategies follows in the subsequent sections.

5.2.1 Pointwise methods

The pointwise methods measure the relevance between a query and a single document, and can be categorized into two types: relevance generation [127] and query generation [128, 129]. The upper part in Figure 6 (a) shows the relevance generation based on a given prompt, where LLMs output “Yes” if the document is relevant to the query and “No” otherwise. The probability that LLMs generate the to-ken “Yes” or “No” is used to calculate the query-document relevance score s as follows:

where p(Yes/No) represents the generation probability of “Yes” or “No” token.

As for the query generation, the lower part in Figure 6 (a) shows an example in which the query-document relevance score is determined by the average log-likelihood of generating the actual query tokens based on the document:

where |q| denotes the token number of query q, d denotes the document, and P represents the provided prompt. The documents are then reranked based on their relevance scores. Existing query generation-based methods [128] primarily rely on a handcrafted prompt (e.g., “Please write a query based on this document”), which may not be optimal. As prompt is a key factor in instructing LLMs to perform various NLP tasks, it is important to identify the optimal prompt for best model performance. Along this line, a discrete prompt optimization method Co-Prompt [129] is proposed for better prompt generation in reranking tasks.

5.2.2 Listwise Methods

Pointwise methods calculate the query-document relevance score using the log probability of output tokens, which is usually unavailable for LLM APIs (e.g., ChatGPT and GPT-4). Recently, listwise methods [130, 131] that directly rank a list of documents have been proposed as a solution to this problem. These methods insert the query and a document list into the prompt and instruct the LLMs to output the reranked document identifiers (see Figure 6 (b)). Considering the limited input length of LLMs, these methods also employ a sliding window strategy to rerank a subset of candidate documents each time [130]. Extensive experiments based on several LLMs reveal that the GPT-4-based method achieves competitive results and even outperforms some supervised methods on some IR benchmarks.

5.2.3 Pairwise Methods

Although listwise methods have yielded promising performance, they still suffer from some weaknesses. First, according to the experimental results [130], only the GPT-4-based method is able to achieve competitive performance. When using smaller parameterized language models (e.g., FLAN-UL2 with 20B parameters), listwise methods may produce very few usable results and underperform many supervised methods. Second, the performance of listwise methods is highly sensitive to the document order in the prompt. When the document order is randomly shuffled, listwise methods perform even worse than BM25. These weaknesses may be due to the fact that existing popular LLMs are generally not pre-trained against ranking tasks.

A recent study [132] has shown that LLMs inherently have a sense of pairwise document comparison, which is much simpler than directly outputting a reranked document list. In this study, a pairwise ranking prompt (see Figure 6 (c)) is first designed to compare whether one document is more relevant than another document. Then, several ranking algorithms are devised to leverage this pairwise ranking prompt as the computation unit for reranking the whole document list. The experimental results show the state-of-the-art performance on the standard benchmarks using moderate-size LLMs (e.g., Flan-UL2 with 20B parameters), which are much smaller than GPT-4.

5.3 Utilizing LLMs for Training Data Augmentation

Furthermore, in the realm of reranking, researchers have explored the integration of LLMs for training data augmentation [130, 133–135]. For example, ExaRanker [133] generates explanations for retrieval datasets using GPT-3.5, subsequently trains a seq2seq ranking model to generate relevance labels along with corresponding explanations for given query-document pairs. InPars-Light [134] is proposed as a cost-effective method to synthesize queries for documents by prompting LLMs. Contrary to InPars-Light [134], a new dataset ChatGPT-RetrievalQA [135] is constructed by generating synthetic documents based on LLMs in response to user queries. Besides, there have been attempts to distill the document ranking capability of ChatGPT into a specialized model [130]. In this approach, ChatGPT is first instructed to directly generate a ranking list of the documents. Then, the generated document list is used as the target to train a student model (i.e., DeBERTa-v3-base) using various ranking losses (e.g., RankNet [125]).

5.4 Limitations

Although recent research on utilizing LLMs for document reranking has made significant progress, it still faces some challenges. For example, considering the cost and efficiency, minimizing the number of calls to LLM APIs is a problem worth studying. Besides, while existing studies mainly focus on applying LLMs to open-domain datasets (such as MS-MARCO [102]) or relevance-based text ranking tasks, their adaptability to in-domain datasets [114] and non-standard ranking datasets [136] remains an area that demands more comprehensive exploration.

6 READER

With the impressive capabilities of LLMs in understanding, extracting, and processing textual data, researchers explore expanding the scope of IR systems beyond content ranking to answer generation. In this evolution, a reader module has been introduced to generate answers based on the document corpus in IR systems. By integrating a reader module, IR systems can directly present conclusive passages to users. Compared with providing a list of documents, users can simply comprehend the answering passages instead of analyzing the ranking list in this new paradigm. Furthermore, by repeatedly providing documents to LLMs based on their generating texts, the final generated answers can potentially be more accurate and information-rich than the original retrieved lists.

A naive strategy for implementing this function is to heuristically provide LLMs with documents relevant to the user queries or the previously generated texts to support following generation. However, this passive approach limits LLMs to merely collecting documents from IR systems with-out active engagement. An alternative solution is to train LLMs to interact proactively with search engines. For example, LLMs can formulate their own queries instead of relying solely on user queries or generated texts for references. According to the way LLMs utilize IR systems in the reader module, we can categorize them into passive reader and active reader. Each approach has its advantages and challenges for implementing LLM-powered answer generation in IR systems.

6.1 Passive Reader

To generate answers for users, a straightforward strategy is to supply the retrieved documents according to the queries or previously generated texts from IR systems as inputs to LLMs for creating passages [23, 137–146]. By this means, these approaches use the LLMs and IR systems in a separate manner, with LLMs functioning as passive recipients of documents from the IR systems. The strategies for utilizing LLMs within IR systems’ reader modules can be categorized into the following three groups according to the frequency of retrieving documents for LLMs.

6.1.1 Once-Retrieval Reader

To obtain useful references for LLMs to generate responses for user queries, an intuitive way is to retrieve the top documents based on the queries themselves in the beginning. For example, REALM [137] adopts this strategy by directly attending the document contents to the original queries to predict the final answers based on masked language modeling. RAG [138] follows this strategy but applies the generative language modeling paradigm. However, these two approaches only use language models with limited parameters, such as BERT and BART. Recent approaches such as REPLUG [139] and Atlas [140] have improved them by leveraging LLMs such as GPTs and T5s for response generation. To yield better answer generation performances, these models usually fine-tune LLMs on QA tasks. How-ever, due to the limited computing resources, many methods [141, 142] choose to prompt LLMs for generation as they could use larger LMs in this way. To better facilitate fixed LLMs to perform the reading tasks, RETA-LLM [143] breaks down the whole complex generation task into several simple modules in the reader pipeline. These modules include a query rewriting module for refining query intents, a passage extraction module for aligning reference lengths with LLM

TABLE 5. The comparison of existing methods that have a passive reader module. REALM and RAG do not use LLMs, but their frameworks have been widely applied in many following approaches.

limitations, and a fact verification module for confirming the absence of fabricated information in the generated answers.

6.1.2 Periodic-Retrieval Reader

However, while generating long conclusive answers, it is shown [23, 144] that only using the references retrieved by the original user intents as in once-retrieval readers may be inadequate. For example, when providing a pas-sage about “Barack Obama”, language models may need additional knowledge about his university, which may not be included in the results of simply searching the initial query. In conclusion, language models may need extra references to support the following generation during the generating process, where multiple retrieval processes may be required. To address this, solutions such as RETRO [23] and RALM [144] have emerged, emphasizing the periodic collection of documents based on both the original queries and the concurrently generated texts (triggering retrieval every n generated tokens). In this manner, when generating the text about the university career of Barack Obama, the LLM can receive additional documents as supplementary materials. This need for additional references highlights the necessity for multiple retrieval iterations to ensure robust-ness in subsequent answer generation. Notably, RETRO [23] introduces a novel approach incorporating cross-attention between the generating texts and the references within the Transformer attention calculation, as opposed to directly embedding references into the input texts of LLMs. Since it involves additional cross-attention modules in the Trans-former’s structure, RETRO trains this model from scratch. However, these two approaches mainly rely on the successive n tokens to separate generation and retrieve documents, which may not be semantically continuous and may cause the collected references noisy and useless. To solve this problem, some approaches such as IRCoT [145] also explore retrieving documents for every generated sentence, which is a more complete semantic structure.

6.1.3 Aperiodic-Retrieval Reader

In the above strategy, the retrieval systems supply documents to LLMs in a periodic manner. However, retrieving documents in a mandatory frequency may mismatch the retrieval timing and can be costly. Recently, FLARE [146] has addressed this problem by automatically determining the timing of retrieval according to the probability of generating texts. Since the probability can serve as an indicator of LLMs’ confidence during text generation [147, 148], a low probability for a generated term could suggest that LLMs require additional knowledge. Specifically, when the probability of a term falls below a predefined threshold, FLARE employs IR systems to retrieve references in accordance with the ongoing generated sentences, while removing these low-probability terms. FLARE adopts this strategy of prompting LLMs for answer generation solely based on the probabilities of generating terms, avoiding the need for fine-tuning while still maintaining effectiveness.

We summarize these passive reader approaches in Table 5, considering various aspects such as the backbone language models, the insertion point for retrieved references, the timing of using retrieval models, and the tuning strategy employed for LLMs.

6.2 Active Reader

However, the passive reader-based approaches separate IR systems and generative language models. This signifies that LLMs can only submissively utilize references provided by IR systems and are unable to interactively engage with the IR systems in a manner akin to human interaction such as issuing queries to seek information.

To allow LLMs to actively use search engines, Self-Ask [149] and DSP [150] try to employ few-shot prompts for LLMs, triggering them to search queries when they believe it is required. For example, in a scenario where the query is “When was the existing tallest wooden lattice tower built?”, these prompted LLMs can decide to search a query “What is the existing tallest wooden lattice tower” to gather necessary references as they find the query cannot be directly answered. Once acquired information about the tower, they can iteratively query IR systems for more details until they determine to generate the final answers instead of asking questions. Notably, these methods involve IR systems to construct a single reasoning chain for LLMs. MRC [151] further improves these methods by prompting LLMs to explore multiple reasoning chains and subsequently combining all generated answers together using LLMs.

Instead of prompting LLMs, WebGPT [24] takes an al-ternate approach by training GPT-3 models to use search engines automatically. This is achieved through the application of a reinforcement learning framework, within which a simulated environment is constructed for GPT-3 models. Specifically, the WebGPT model employs special to-kens to execute actions such as querying, scrolling through

rankings, and quoting references on search engines. This innovative approach allows the GPT-3 model to use search engines for text generation, enhancing the reliability and real-time capability of the generated texts. Some following models [152] have extended this paradigm to the domain of Chinese question answering.

6.3 Limitations

Several IR systems applying the retrieval-augmented generation strategy, such as New Bing and Langchain, have already entered commercial use. However, there are also some challenges in this novel retrieval-augmented content generation system. These include challenges such as effective query reformulation, optimal retrieval frequency, correct document comprehension, accurate passage extraction, and effective content summarization. It is crucial to address these challenges in order to effectively realize the potential of LLMs in this paradigm.

7 FUTURE DIRECTION

In this survey, we comprehensively reviewed recent advancements in LLM-enhanced IR systems and discussed their limitations. Since the integration of LLMs into IR systems is still in its early stages, there are still many opportunities and challenges. In this section, we summarize the potential future directions in terms of the four modules in an IR system we just discussed, namely query rewriter, retriever, reranker, and reader. In addition, as evaluation has also emerged as an important aspect, we will also introduce the corresponding research problems that need to be addressed in the future. Another discussion about important research topics on applying LLMs to IR can be found in a recent perspective paper [53].

7.1 Query Rewriter

LLMs have enhanced query rewriting for both ad-hoc and conversational search scenarios. Most of the existing methods rely on prompting LLMs to generate new queries. While yielding remarkable results, the refinement of rewriting quality and the exploration of potential application scenarios require further investigation.

• Rewriting query according to ranking performance. A typical paradigm of prompting-based methods is providing LLMs with several ground-truth rewriting cases (optional) and the task description of query rewriting. Despite LLMs being capable of identifying potential user intents of the query [153], they lack awareness of the resulting retrieval quality of the rewritten query. The absence of this connection can result in rewritten queries that seem correct yet produce unsatisfactory ranking results. Although some existing studies have used reinforcement learning to adjust the query rewriting process according to generation results [94], a substantial realm of research remains unexplored concerning the integration of ranking results.

• Improving query rewriting in conversational search. As yet, primary efforts have been made to improve query rewriting in ad-hoc search. In contrast, conversational search presents a more developed landscape with a broader scope for LLMs to contribute to query understanding. By incorporating historical interactive information, LLMs can adapt system responses based on user preferences, providing a more effective conversational experience. However, this potential has not been explored in depth. In addition, LLMs could also be used to simulate user behavior in conversational search scenarios, providing more training data, which are urgently needed in current research.

• Achieving personalized query rewriting. LLMs offer valuable contributions to personalized search through their capacity in analyzing user-specific data. In terms of query rewriting, with the excellent language comprehension ability of LLMs, it is possible to leverage them to build user profiles based on users’ search histories (e.g., issued queries, click-through behaviors, and dwell time). This empowers the achievement of personalized query rewriting for enhanced information retrieval and finally benefits personalized search or personalized recommendation.

7.2 Retriever

Leveraging LLMs to improve retrieval models has received considerable attention, promising an enhanced understanding of queries and documents for improved ranking performance. However, despite strides in this field, several challenges and limitations still need to be investigated in the future:

• Reducing the latency of LLM-based retrievers. LLMs, with their massive parameters and world knowledge, often entail high latency during the inferring process. This delay poses a significant challenge for practical applications of LLM-based retrievers, as search engines require in-time responses. To address this issue, promising research directions include transferring the capabilities of LLMs to smaller models, exploring quantization techniques for LLMs in IR tasks, and so on.

• Simulating realistic queries for data augmentation. Since the high latency of LLMs usually blocks their online application for retrieval tasks, many existing studies have leveraged LLMs to augment training data, which is insensitive to inference latency. Existing methods that leverage LLMs for data augmentation often generate queries without aligning them with real user queries, leading to noise in the training data and limiting the effectiveness of retrievers. As a consequence, exploring techniques such as reinforcement learning to enable LLMs to simulate the way that real queries are issued holds the potential for improving retrieval tasks.

• Incremental indexing for generative retrieval. As elaborated in Section 4.2.2, the emergence of LLMs has paved the way for generative retrievers to generate document identifiers for retrieval tasks. This approach encodes document indexes and knowledge into the LLM parameters. However, the static nature of LLM parameters, coupled with the expensive fine-tuning costs, poses challenges for updating document indexes in generative retrievers when new documents are added. Therefore, it is crucial to explore methods for constructing an incremental index that allows for efficient updates in LLM-based generative retrievers.

• Supporting multi-modal search. Web pages usually contain multi-modal information, including texts, images, audios, and videos. However, existing LLM-enhanced IR systems mainly support retrieval for text-based content. A straightforward solution is to replace the backbone with multi-modal large models, such as GPT-4[80]. However, this undoubtedly increases the cost of deployment. A promising yet challenging direction is to combine the language understanding capability of LLMs with existing multi-modal retrieval models. By this means, LLMs can contribute their language skills in handling different types of content.

7.3 Reranker

In section 5, we have discussed the recent advanced techniques of utilizing LLMs for the reranking task. Some potential future directions in reranking are discussed as follows.

• Enhancing the online availability of LLMs. Though effective, many LLMs have a massive number of parameters, making it challenging to deploy them in online applications. Besides, many reranking methods [130, 131] rely on calling LLM APIs, incurring considerable costs. Consequently, de-vising effective approaches (such as distilling to small models) to enhance the online applicability of LLMs emerges as a research direction worth exploration.

•Improving personalized search. Many existing LLM-based reranking methods mainly focus on the ad-hoc reranking task. However, by incorporating user-specific information, LLMs can also improve the effectiveness of the personalized reranking task. For example, by analyzing users’ search his-tory, LLMs can construct accurate user profiles and rerank the search results accordingly, providing personalized re-sults with higher user satisfaction.

• Adapting to diverse ranking tasks. In addition to document reranking, there are also other ranking tasks, such as response ranking, evidence ranking, entity ranking and etc., which also belong to the universal information access system. Navigating LLMs towards adeptness in these di-verse ranking tasks can be achieved through specialized methodologies, such as instruction tuning. Exploring this avenue holds promise as an intriguing and valuable re-search trajectory.

7.4 Reader

With the increasing capabilities of LLMs, the future inter-action between users and IR systems will be significantly changed. Due to the powerful natural language processing and understanding capabilities of LLMs, the traditional search paradigm of providing ranking results is expected to be progressively replaced by synthesizing conclusive answering passages for user queries using the reader module. Although such strategies have already been investigated by academia and facilitated by industry as we stated in Section 6, there still exists much room for exploration.

• Improving the reference quality for LLMs. To support answer generation, existing approaches usually directly feed the retrieved documents to the LLMs as references. How-ever, since a document usually covers many topics, some passages in it may be irrelevant to the user queries and can introduce noise during LLMs’ generation. Therefore, it is necessary to explore techniques for extracting relevant snip-pets from retrieved documents, enhancing the performance of retrieval-augmented generation.

• Improving the answer reliability of LLMs. Incorporating the retrieved references has significantly alleviated the hallucination problem of LLMs. However, it remains uncertain whether the LLMs refer to these supported materials during answering queries. Some studies [154] have revealed that LLMs can still provide unfaithful answers even with additional references. Therefore, the reliability of the conclusive answers might be lower compared to the ranking results provided by traditional IR systems. It is essential to investigate the influence of these references on the generation process, thereby improving the credibility of reader-based novel IR systems.

7.5 Evaluation

LLMs have attracted significant attention in the field of IR due to their strong ability in context understanding and text generation. To validate the effectiveness of LLM-enhanced IR approaches, it is crucial to develop appropriate evaluation metrics. Given the growing significance of readers as integral components of IR systems, the evaluation should consider two aspects: assessing ranking performance and evaluating generation performance.

• Generation-oriented ranking evaluation. Traditional evaluation metrics for ranking primarily focus on comparing the retrieval results of IR models with ground-truth (relevance) labels. Typical metrics include precision, recall, mean reciprocal rank (MRR) [155], mean average precision (MAP), and normalized discounted cumulative gain (nDCG) [156]. These metrics measure the alignment between ranking results and human preference on using these results. Nevertheless, these metrics may fall short in capturing a document’s role in the generation of passages or answers, as their relevance to the query alone might not adequately reflect this aspect. This effect could be leveraged as a means to evaluate the usefulness of documents more comprehensively. A formal and rigorous evaluation metric for ranking that centers on generation quality has yet to be defined.

•Text generation evaluation. The wide application of LLMs in IR has led to a notable enhancement in their generation capability. Consequently, there is an imperative demand for novel evaluation strategies to effectively evaluate the performance of passage or answer generation. Previous evaluation metrics for text generation have several limitations, including: (1) Dependency on lexical matching: methods such as BLEU [157] or ROUGE [158] primarily evaluate the quality of generated outputs based on n-gram matching. This approach cannot account for lexical diversity and con-textual semantics. As a result, models may favor generating common phrases or sentence structures rather than producing creative and novel content. (2) Insensitivity to subtle differences: existing evaluation methods may be insensitive to subtle differences in generated outputs. For example, if a generated output has minor semantic differences from the reference answer but is otherwise similar, traditional methods might overlook these nuanced distinctions. (3) Lack of ability to evaluate factuality: LLMs are prone to generating “hallucination” problems [159–161]. The hallucinated texts can closely resemble the oracle texts in terms of vocabulary usage, sentence structures, and patterns, while having non-factual content. Existing methods are hard to identify such problems, while the incorporation of additional knowledge sources such as knowledge bases or reference texts could potentially aid in addressing this challenge.

8 CONCLUSION

In this survey, we have conducted a thorough exploration of the transformative impact of LLMs on IR across various dimensions. We have organized existing approaches into distinct categories based on their functions: query rewriting, retrieval, reranking, and reader modules. In the do-main of query rewriting, LLMs have demonstrated their effectiveness in understanding ambiguous or multi-faceted queries, enhancing the accuracy of intent identification. In the context of retrieval, LLMs have improved retrieval accuracy by enabling more nuanced matching between queries and documents, considering context as well. Within the reranking realm, LLM-enhanced models consider more fine-grained linguistic nuances when re-ordering results. The incorporation of reader modules in IR systems represents a significant step towards generating comprehensive responses instead of mere document lists. The integration of LLMs into IR systems has brought about a fundamental change in how users engage with information and knowledge. From query rewriting to retrieval, reranking, and reader modules, LLMs have enriched each aspect of the IR process with advanced linguistic comprehension, semantic representation, and context-sensitive handling. As this field continues to progress, the journey of LLMs in IR portends a future characterized by more personalized, precise, and user-centric search encounters.

This survey focuses on reviewing recent studies of applying LLMs to different information retrieval components. Beyond this, a more significant problem brought by the appearance of LLMs is: is the conventional IR framework necessary in the era of LLMs? For example, traditional IR aims to return a ranking list of documents that are relevant to issued queries. However, the development of generative language models has introduced a novel paradigm: the direct generation of answers to input questions. Furthermore, according to a recent perspective paper[53], IR might evolve into a fundamental service for diverse systems. For example, in a multi-agent simulation system [162], an IR component can be used for memory recall. This implies that there will be many new challenges in future IR.

REFERENCES

[1]     Y. Wu, W. Wu, C. Xing, M. Zhou, and Z. Li, “Sequential matching network: A new architecture for multi-turn response selection in retrieval-based chat-bots,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 – August 4, Volume 1: Long Papers, R. Barzilay and M. Kan, Eds. Association for Computational Linguistics, 2017, pp. 496–505.

[2]     H. Shum, X. He, and D. Li, “From eliza to xiaoice: challenges and opportunities with social chatbots,” Frontiers Inf. Technol. Electron. Eng., vol. 19, no. 1, pp. 10–26, 2018.

[3]      V. Karpukhin, B. Oguz, S. Min, P. S. H. Lewis, L. Wu, S. Edunov, D. Chen, and W. Yih, “Dense passage retrieval for open-domain question answering,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, B. Webber, T. Cohn, Y. He, and Y. Liu, Eds. Association for Computational Linguis-tics, 2020, pp. 6769–6781.

[4]     R. Datta, D. Joshi, J. Li, and J. Z. Wang, “Image re-trieval: Ideas, influences, and trends of the new age,” ACM Comput. Surv., vol. 40, no. 2, pp. 5:1–5:60, 2008.

[5]     C. Yuan, W. Zhou, M. Li, S. Lv, F. Zhu, J. Han, and S. Hu, “Multi-hop selector network for multi-turn response selection in retrieval-based chatbots,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, K. Inui, J. Jiang, V. Ng, and X. Wan, Eds. Association for Computational Linguistics, 2019, pp. 111–120.

[6]     Y. Zhu, J. Nie, K. Zhou, P. Du, and Z. Dou, “Content selection network for document-grounded retrieval-based chatbots,” in Advances in Information Retrieval – 43rd European Conference on IR Research, ECIR 2021, Virtual Event, March 28 – April 1, 2021, Proceedings, Part I, ser. Lecture Notes in Computer Science, D. Hiem-stra, M. Moens, J. Mothe, R. Perego, M. Potthast, and F. Sebastiani, Eds., vol. 12656.      Springer, 2021, pp. 755–769.

[7]     Y. Zhu, J. Nie, K. Zhou, P. Du, H. Jiang, and Z. Dou, “Proactive retrieval-based chatbots based on relevant knowledge and goals,” in SIGIR ’21: The 44th Inter-national ACM SIGIR Conference on Research and Devel-opment in Information Retrieval, Virtual Event, Canada, July 11-15, 2021, F. Diaz, C. Shah, T. Suel, P. Castells, R. Jones, and T. Sakai, Eds. ACM, 2021, pp. 2000– 2004.

[8]     H. Qian, Z. Dou, Y. Zhu, Y. Ma, and J. Wen, “Learning implicit user profiles for personalized retrieval-based chatbot,” CoRR, vol. abs/2108.07935, 2021.

[9]     Y. Qu, Y. Ding, J. Liu, K. Liu, R. Ren, W. X. Zhao, D. Dong, H. Wu, and H. Wang, “Rocketqa: An optimized training approach to dense passage retrieval for open-domain question answering,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, and Y. Zhou, Eds.   Association for Computational Linguistics, 2021, pp. 5835–5847.

[10] Y. Arens, C. A. Knoblock, and W. Shen, “Query re-formulation for dynamic information integration,” J. Intell. Inf. Syst., vol. 6, no. 2/3, pp. 99–130, 1996.

[11] J. Huang and E. N. Efthimiadis, “Analyzing and evaluating query reformulation strategies in web search logs,” in Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM 2009, Hong Kong, China, November 2-6, 2009, D. W. Cheung, I. Song, W. W. Chu, X. Hu, and J. Lin, Eds. ACM, 2009, pp. 77–86.

[12] R. F. Nogueira, W. Yang, K. Cho, and J. Lin, “Multi-stage document ranking with BERT,” CoRR, vol. abs/1910.14424, 2019.

[13] R. F. Nogueira, Z. Jiang, R. Pradeep, and J. Lin, “Document ranking with a pretrained sequence-to-sequence model,” in EMNLP (Findings), ser. Findings of ACL, vol. EMNLP 2020.  Association for Computational Linguistics, 2020, pp. 708–718.

[14] Y. Zhu, J. Nie, Z. Dou, Z. Ma, X. Zhang, P. Du, X. Zuo, and H. Jiang, “Contrastive learning of user behavior sequence for context-aware document ranking,” in CIKM ’21: The 30th ACM International Conference on Information and Knowledge Management, Virtual Event, Queensland, Australia, November 1 – 5, 2021, G. De-martini, G. Zuccon, J. S. Culpepper, Z. Huang, and H. Tong, Eds. ACM, 2021, pp. 2780–2791.

[15] J. Teevan, S. T. Dumais, and E. Horvitz, “Personalizing search via automated analysis of interests and activ-ities,” in SIGIR 2005: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, August 15-19, 2005, R. A. Baeza-Yates, N. Ziviani, G. Marchionini, A. Moffat, and J. Tait, Eds. ACM, 2005, pp. 449–456.

[16] P. N. Bennett, R. W. White, W. Chu, S. T. Dumais, P. Bailey, F. Borisyuk, and X. Cui, “Modeling the impact of short- and long-term behavior on search personalization,” in The 35th International ACM SIGIR conference on research and development in Information Retrieval, SIGIR ’12, Portland, OR, USA, August 12-16, 2012,W.R.Hersh,J.Callan,Y.Maarek,andM.Sander-son, Eds. ACM, 2012, pp. 185–194.

[17] S. Ge, Z. Dou, Z. Jiang, J. Nie, and J. Wen, “Personalizing search results using hierarchical RNN with query-aware attention,” in Proceedings of the 27th ACM International Conference on Information and Knowledge Management, CIKM 2018, Torino, Italy, October 22-26, 2018, A. Cuzzocrea, J. Allan, N. W. Paton, D. Sri-vastava, R. Agrawal, A. Z. Broder, M. J. Zaki, K. S. Candan, A. Labrinidis, A. Schuster, and H. Wang, Eds. ACM, 2018, pp. 347–356.

[18] Y. Zhou, Z. Dou, Y. Zhu, and J. Wen, “PSSL: self-supervised learning for personalized search with con-trastive sampling,” in CIKM ’21: The 30th ACM Inter-national Conference on Information and Knowledge Management, Virtual Event, Queensland, Australia, November 1 – 5, 2021, G. Demartini, G. Zuccon, J. S. Culpepper, Z. Huang, and H. Tong, Eds. ACM, 2021, pp. 2749– 2758.

[19] J. G. Carbonell and J. Goldstein, “The use of mmr, diversity-based reranking for reordering documents and producing summaries,” in SIGIR ’98: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 24-28 1998, Melbourne, Australia, W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel, Eds. ACM, 1998, pp. 335–336.

[20] R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong, “Diversifying search results,” in Proceedings of the Sec-ond International Conference on Web Search and Web Data Mining, WSDM 2009, Barcelona, Spain, February 9-11, 2009, R. Baeza-Yates, P. Boldi, B. A. Ribeiro-Neto, and B. B. Cambazoglu, Eds. ACM, 2009, pp. 5–14.

[21] J. Liu, Z. Dou, X. Wang, S. Lu, and J. Wen, “DVGAN: A minimax game for search result diversification combining explicit and implicit features,” in Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25-30, 2020, J. X. Huang, Y. Chang, X. Cheng, J. Kamps, V. Murdock, J. Wen, and Y. Liu, Eds. ACM, 2020, pp. 479–488.

[22] Z. Su, Z. Dou, Y. Zhu, X. Qin, and J. Wen, “Modeling intent graph for search result diversification,”inSIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021, F. Diaz, C. Shah, T. Suel, P. Castells, R. Jones, and T. Sakai, Eds. ACM, 2021, pp. 736–746.

[23] S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Millican, G. van den Driessche, J. Lespiau, B. Damoc, A. Clark, D. de Las Casas, A. Guy, J. Menick, R. Ring, T. Hennigan, S. Huang, L. Maggiore, C. Jones, A. Cassirer, A. Brock, M. Pa-ganini, G. Irving, O. Vinyals, S. Osindero, K. Si-monyan, J. W. Rae, E. Elsen, and L. Sifre, “Improv-ing language models by retrieving from trillions of tokens,” in International Conference on Machine Learn-ing, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, ser. Proceedings of Machine Learning Research, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, Eds., vol. 162.    PMLR, 2022, pp. 2206–2240.

[24] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saun-ders, X. Jiang, K. Cobbe, T. Eloundou, G. Krueger, K. Button, M. Knight, B. Chess, and J. Schulman, “Webgpt: Browser-assisted question-answering with human feedback,” CoRR, vol. abs/2112.09332, 2021.

[25] G. Salton and M. McGill, Introduction to Modern Infor-mation Retrieval. McGraw-Hill Book Company, 1984.

[26] G. Salton, A. Wong, and C. Yang, “A vector space model for automatic indexing,” Commun. ACM, vol. 18, no. 11, pp. 613–620, 1975.

[27] F. Song and W. B. Croft, “A general language model for information retrieval,” in Proceedings of the 1999 ACM CIKM International Conference on Information and Knowledge Management, Kansas City, Missouri, USA, November 2-6, 1999. ACM, 1999, pp. 316–321.

[28] J. Martineau and T. Finin, “Delta TFIDF: an improved feature space for sentiment analysis,” in Proceedings of the Third International Conference on Weblogs and Social Media, ICWSM 2009, San Jose, California, USA, May 17-20, 2009, E. Adar, M. Hurst, T. Finin, N. S. Glance, N. Nicolov, and B. L. Tseng, Eds.    The AAAI Press, 2009.

[29] S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford, “Okapi at TREC-3,” in Proceedings of The Third Text REtrieval Conference, TREC 1994, Gaithersburg, Maryland, USA, November 2-4, 1994, ser. NIST Special Publication, D. K. Harman, Ed., vol. 500-225.         National Institute of Standards and Technology (NIST), 1994, pp. 109–126.

[30] J. Guo, Y. Fan, Q. Ai, and W. B. Croft, “A deep relevance matching model for ad-hoc retrieval,” in Proceedings of the 25th ACM International Conference on Information and Knowledge Management, CIKM 2016, In-dianapolis, IN, USA, October 24-28, 2016, S. Mukhopad-hyay, C. Zhai, E. Bertino, F. Crestani, J. Mostafa, J. Tang, L. Si, X. Zhou, Y. Chang, Y. Li, and P. Sondhi, Eds. ACM, 2016, pp. 55–64.

[31] L. Xiong, C. Xiong, Y. Li, K. Tang, J. Liu, P. N. Bennett, J. Ahmed, and A. Overwijk, “Approximate nearest neighbor negative contrastive learning for dense text retrieval,” in 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.

[32] J. Lin, R. F. Nogueira, and A. Yates, Pretrained Trans-formers for Text Ranking: BERT and Beyond, ser. Syn-thesis Lectures on Human Language Technologies. Morgan & Claypool Publishers, 2021.

[33] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” 2019.

[34] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. Mc-Candlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” in Ad-vances in Neural Information Processing Systems 33: An-nual Conference on Neural Information Processing Sys-tems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., 2020.

[35] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Roziere, N. Goyal, E. Ham-bro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,” CoRR, vol. abs/2302.13971, 2023.

[36] J. Zhang, R. Xie, Y. Hou, W. X. Zhao, L. Lin, and J. Wen, “Recommendation as instruction following: A large language model empowered recommendation approach,” CoRR, vol. abs/2305.07001, 2023.

[37] Y. Hou, J. Zhang, Z. Lin, H. Lu, R. Xie, J. J. McAuley, and W. X. Zhao, “Large language models are zero-shot rankers for recommender systems,” CoRR, vol. abs/2305.08845, 2023.

[38] Y. Xi, W. Liu, J. Lin, J. Zhu, B. Chen, R. Tang, W. Zhang, R. Zhang, and Y. Yu, “Towards open-world recommendation with knowledge augmentation from large language models,” CoRR, vol. abs/2306.10933, 2023.

[39] W. Fan, Z. Zhao, J. Li, Y. Liu, X. Mei, Y. Wang, J. Tang, and Q. Li, “Recommender systems in the era of large language models (llms),” CoRR, vol. abs/2307.02046, 2023.

[40] S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze, S. Gehrmann, P. Kambadur, D. S. Rosenberg, and G. Mann, “Bloomberggpt: A large language model for finance,” CoRR, vol. abs/2303.17564, 2023.

[41] J. Li, Y. Liu, W. Fan, X. Wei, H. Liu, J. Tang, and Q. Li, “Empowering molecule discovery for molecule-
caption translation with large language models: A chatgpt perspective,” CoRR, vol. abs/2306.06615, 2023.

[42] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, and W. Fedus, “Emergent abilities of large language models,” Trans. Mach. Learn. Res., vol. 2022, 2022.

[43] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wain-wright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe, “Training language mod-els to follow instructions with human feedback,” in NeurIPS, 2022.

[44] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le, “Fine-tuned language models are zero-shot learners,” in The Tenth International Conference on Learning Repre-sentations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.

[45] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” in NeurIPS, 2022.

[46] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, prompt, and predict: A system-atic survey of prompting methods in natural language processing,” ACM Comput. Surv., vol. 55, no. 9, pp. 195:1–195:35, 2023.

[47] X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang, “Pre-trained models for natural language processing: A survey,” CoRR, vol. abs/2003.08271, 2020.

[48] Y. Cao, S. Li, Y. Liu, Z. Yan, Y. Dai, P. S. Yu, and L. Sun, “A comprehensive survey of ai-generated content (AIGC): A history of generative AI from GAN to chatgpt,” CoRR, vol. abs/2303.04226, 2023.

[49] J. Li, T. Tang, W. X. Zhao, and J. Wen, “Pretrained language model for text generation: A survey,” in Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19-27 August 2021, Z. Zhou, Ed. ijcai.org, 2021, pp. 4492–4499.

[50] Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, L. Li, and Z. Sui, “A survey for in-context learning,” CoRR, vol. abs/2301.00234, 2023.

[51] J. Huang and K. C. Chang, “Towards reasoning in large language models: A survey,” in Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, A. Rogers, J. L. Boyd-Graber, and N. Okazaki, Eds. Association for Computational Linguistics, 2023, pp. 1049–1065.

[52] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, Y. Du, C. Yang, Y. Chen, Z. Chen, J. Jiang, R. Ren, Y. Li, X. Tang, Z. Liu, P. Liu, J. Nie, and J. Wen, “A survey of large language models,” CoRR, vol. abs/2303.18223, 2023.

[53] Q. Ai, T. Bai, Z. Cao, Y. Chang, J. Chen, Z. Chen, Z. Cheng, S. Dong, Z. Dou, F. Feng, S. Gao, J. Guo, X. He, Y. Lan, C. Li, Y. Liu, Z. Lyu, W. Ma, J. Ma,

Z. Ren, P. Ren, Z. Wang, M. Wang, J. Wen, L. Wu, X. Xin, J. Xu, D. Yin, P. Zhang, F. Zhang, W. Zhang, M. Zhang, and X. Zhu, “Information retrieval meets large language models: A strategic report from chinese IR community,” CoRR, vol. abs/2307.09751, 2023.

[54] X. Liu and W. B. Croft, “Statistical language modeling for information retrieval,” Annu. Rev. Inf. Sci. Technol., vol. 39, no. 1, pp. 1–31, 2005.

[55] B. Mitra and N. Craswell, “Neural models for information retrieval,” CoRR, vol. abs/1705.01509, 2017.

[56] W. X. Zhao, J. Liu, R. Ren, and J. Wen, “Dense text retrieval based on pretrained language models: A survey,” CoRR, vol. abs/2211.14876, 2022.

[57] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, pp. 140:1–140:67, 2020.

[58] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,NAACL-HLT2018,NewOrleans,Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), M. A. Walker, H. Ji, and A. Stent, Eds.      Association for Computational Linguistics, 2018, pp. 2227–2237.

[59] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,NAACL-HLT2019,Minneapolis,MN,USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds. Association for Computational Linguistics, 2019, pp. 4171–4186.

[60] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, Eds., 2017, pp. 5998– 6008.

[61] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo-hamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and com-prehension,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault, Eds. Association for Computational Linguistics, 2020, pp. 7871–7880.

[62] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language mod-els,” CoRR, vol. abs/2001.08361, 2020.

[63] A. Clark, D. de Las Casas, A. Guy, A. Mensch, M. Paganini, J. Hoffmann, B. Damoc, B. A. Hecht-man, T. Cai, S. Borgeaud, G. van den Driessche, E. Rutherford, T. Hennigan, M. J. Johnson, A. Cassirer, C. Jones, E. Buchatskaya, D. Budden, L. Sifre, S. Osin-dero, O. Vinyals, M. Ranzato, J. W. Rae, E. Elsen, K. Kavukcuoglu, and K. Simonyan, “Unified scaling laws for routed language models,” in International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, ser. Proceedings of Machine Learning Research, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, Eds., vol. 162. PMLR, 2022, pp. 4057–4086.

[64] L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H. Hon, “Unified language model pre-training for natural language understanding and generation,” in Advances in Neural Informa-tion Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alche-Buc, E. B. Fox, and R. Garnett, Eds., 2019, pp. 13042– 13054.

[65] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel, “mt5: A massively multilingual pre-trained text-to-text transformer,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, and Y. Zhou, Eds.       Association for Computational Linguistics, 2021, pp. 483–498.

[66] V. Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, A. Raja, M. Dey, M. S. Bari, C. Xu, U. Thakker, S. S. Sharma, E. Szczechla, T. Kim, G. Chhablani, N. V. Nayak, D. Datta, J. Chang, M. T. Jiang, H. Wang, M. Man-ica, S. Shen, Z. X. Yong, H. Pandey, R. Bawden, T. Wang, T. Neeraj, J. Rozen, A. Sharma, A. Santilli, T. Fevry, J. A. Fries, R. Teehan, T. L. Scao, S. Biderman, L. Gao, T. Wolf, and A. M. Rush, “Multitask prompted training enables zero-shot task generalization,” in The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.

[67] H. Bao, L. Dong, F. Wei, W. Wang, N. Yang, X. Liu, Y. Wang, J. Gao, S. Piao, M. Zhou, and H. Hon, “Unilmv2: Pseudo-masked language models for unified language model pre-training,” in Proceedings of the 37thInternationalConferenceonMachineLearning,ICML 2020, 13-18 July 2020, Virtual Event, ser. Proceedings of Machine Learning Research, vol. 119. PMLR, 2020, pp. 642–652.

[68] A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y. Xu, W. Zheng, X. Xia, W. L. Tam, Z. Ma, Y. Xue, J. Zhai, W. Chen, Z. Liu, P. Zhang, Y. Dong, and J. Tang, “GLM-130B: an open bilingual pretrained model,” in The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.

[69] W. Fedus, B. Zoph, and N. Shazeer, “Switch trans-formers: Scaling to trillion parameter models with simple and efficient sparsity,” J. Mach. Learn. Res., vol. 23, pp. 120:1–120:39, 2022.

[70] Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language understanding,” in Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alche-Buc, E. B. Fox, and R. Garnett, Eds., 2019, pp. 5754–5764.

[71] S. Black, S. Biderman, E. Hallahan, Q. Anthony, L. Gao, L. Golding, H. He, C. Leahy, K. McDonell, J. Phang, M. Pieler, U. S. Prashanth, S. Purohit, L. Reynolds, J. Tow, B. Wang, and S. Weinbach, “Gpt-neox-20b: An open-source autoregressive language model,” CoRR, vol. abs/2204.06745, 2022.

[72] J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoff-mann, H. F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young, E. Rutherford, T. Hennigan, J. Menick, A. Cassirer, R. Powell, G. van den Driessche, L. A. Hendricks, M. Rauh, P. Huang, A. Glaese, J. Welbl, S. Dathathri, S. Huang, J. Uesato, J. Mellor, I. Higgins, A. Creswell, N. McAleese, A. Wu, E. Elsen, S. M. Jayakumar, E. Buchatskaya, D. Budden, E. Suther-land, K. Simonyan, M. Paganini, L. Sifre, L. Martens, X. L. Li, A. Kuncoro, A. Nematzadeh, E. Gribovskaya, D. Donato, A. Lazaridou, A. Mensch, J. Lespiau, M. Tsimpoukelli, N. Grigorev, D. Fritz, T. Sotti-aux, M. Pajarskas, T. Pohlen, Z. Gong, D. Toyama, C. de Masson d’Autume, Y. Li, T. Terzi, V. Mikulik, I. Babuschkin, A. Clark, D. de Las Casas, A. Guy, C. Jones, J. Bradbury, M. J. Johnson, B. A. Hecht-man, L. Weidinger, I. Gabriel, W. Isaac, E. Lockhart, S. Osindero, L. Rimell, C. Dyer, O. Vinyals, K. Ayoub, J. Stanway, L. Bennett, D. Hassabis, K. Kavukcuoglu, and G. Irving, “Scaling language models: Methods, analysis & insights from training gopher,” CoRR, vol. abs/2112.11446, 2021.

[73] N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat, B. Zoph, L. Fedus, M. P. Bosma, Z. Zhou, T. Wang, Y. E. Wang, K. Webster, M. Pellat, K. Robinson, K. S. Meier-Hellstern, T. Duke, L. Dixon, K. Zhang, Q. V. Le, Y. Wu, Z. Chen, and C. Cui, “Glam: Efficient scaling of language models with mixture-of-experts,” in International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, ser. Proceedings of Machine Learning Research, K. Chaud-huri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, Eds., vol. 162. PMLR, 2022, pp. 5547–5569.

[74] Y. Sun, S. Wang, S. Feng, S. Ding, C. Pang, J. Shang, J. Liu, X. Chen, Y. Zhao, Y. Lu, W. Liu, Z. Wu, W. Gong, J. Liang, Z. Shang, P. Sun, W. Liu, X. Ouyang, D. Yu, H. Tian, H. Wu, and H. Wang, “ERNIE 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation,” CoRR, vol. abs/2107.02137, 2021.

[75] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. T. Diab, X. Li, X. V. Lin, T. Mihaylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, P. S. Koura, A. Sridhar, T. Wang, and L. Zettlemoyer, “OPT: open pre-trained transformer language models,” CoRR, vol. abs/2205.01068, 2022.

[76] R. Thoppilan, D. D. Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H. Cheng, A. Jin, T. Bos, L. Baker, Y. Du, Y. Li, H. Lee, H. S. Zheng, A. Ghafouri, M. Menegali, Y. Huang, M. Krikun, D. Lepikhin, J. Qin, D. Chen, Y. Xu, Z. Chen, A. Roberts, M. Bosma, Y. Zhou, C. Chang, I. Krivokon, W. Rusch, M. Pick-ett, K. S. Meier-Hellstern, M. R. Morris, T. Doshi, R. D. Santos, T. Duke, J. Soraker, B. Zevenbergen, V. Prabhakaran, M. Diaz, B. Hutchinson, K. Olson, A. Molina, E. Hoffman-John, J. Lee, L. Aroyo, R. Ra-jakumar, A. Butryna, M. Lamm, V. Kuzmina, J. Fenton, A. Cohen, R. Bernstein, R. Kurzweil, B. A. y Arcas, C. Cui, M. Croak, E. H. Chi, and Q. Le, “Lamda: Language models for dialog applications,” CoRR, vol. abs/2201.08239, 2022.

[77] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Is-ard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghe-mawat, S. Dev, H. Michalewski, X. Garcia, V. Misra, K. Robinson, L. Fedus, D. Zhou, D. Ippolito, D. Luan, H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Do-han, S. Agrawal, M. Omernick, A. M. Dai, T. S. Pil-lai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child, O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta, M. Diaz, O. Firat, M. Catasta, J. Wei, K. Meier-Hellstern, D. Eck, J. Dean, S. Petrov, and N. Fiedel, “Palm: Scaling language modeling with pathways,” CoRR, vol. abs/2204.02311, 2022.

[78] T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilic, D. Hess-low, R. Castagne, A. S. Luccioni, F. Yvon, M. Galle, J. Tow, A. M. Rush, S. Biderman, A. Webson, P. S. Ammanamanchi, T. Wang, B. Sagot, N. Muennighoff, A. V. del Moral, O. Ruwase, R. Bawden, S. Bekman, A. McMillan-Major, I. Beltagy, H. Nguyen, L. Saulnier, S. Tan, P. O. Suarez, V. Sanh, H. Laurenc¸on, Y. Jer-nite, J. Launay, M. Mitchell, C. Raffel, A. Gokaslan, A. Simhi, A. Soroa, A. F. Aji, A. Alfassy, A. Rogers, A. K. Nitzav, C. Xu, C. Mou, C. Emezue, C. Klamm, C. Leong, D. van Strien, D. I. Adelani, and et al., “BLOOM:A176b-parameteropen-accessmultilingual language model,” CoRR, vol. abs/2211.05100, 2022.

[79] A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V. V. Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, Y. Wu, B. Neyshabur, G. Gur-Ari, and V. Misra, “Solving quantitative rea-soning problems with language models,” in NeurIPS, 2022.

[80] OpenAI, “GPT-4 technical report,” CoRR, vol. abs/2303.08774, 2023.

[81] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hen-dricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae,

O. Vinyals, and L. Sifre, “Training compute-optimal large language models,” CoRR, vol. abs/2203.15556, 2022.

[82] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” in The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.

[83] X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, C. Zong, F. Xia, W. Li, and R. Navigli, Eds. Association for Computational Linguistics, 2021, pp. 4582–4597.

[84] B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds. AssociationforComputationalLinguistics,2021, pp. 3045–3059.

[85] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,” CoRR, vol. abs/2305.14314, 2023.

[86] L. Wang, N. Yang, and F. Wei, “Query2doc: Query expansion with large language models,” CoRR, vol. abs/2303.07678, 2023.

[87] N. A. Jaleel, J. Allan, W. B. Croft, F. Diaz, L. S. Larkey, X. Li, M. D. Smucker, and C. Wade, “Umass at TREC 2004: Novelty and HARD,” in Proceedings of the Thirteenth Text REtrieval Conference, TREC 2004, Gaithersburg, Maryland, USA, November 16-19, 2004, ser. NIST Special Publication, E. M. Voorhees and L. P. Buckland, Eds., vol. 500-261. National Institute of Standards and Technology (NIST), 2004.

[88] D. Metzler and W. B. Croft, “Latent concept expansion using markov random fields,” in SIGIR 2007: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, July 23-27, 2007, W. Kraaij, A. P. de Vries, C. L. A. Clarke, N. Fuhr, and N. Kando, Eds. ACM, 2007, pp. 311–318.

[89] C. Zhai and J. D. Lafferty, “Model-based feedback in the language modeling approach to information retrieval,” in Proceedings of the 2001 ACM CIKM Inter-national Conference on Information and Knowledge Management, Atlanta, Georgia, USA, November 5-10, 2001. ACM, 2001, pp. 403–410.

[90] D. Metzler and W. B. Croft, “A markov random field model for term dependencies,” in SIGIR 2005: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, August 15-19, 2005, R. A. Baeza-Yates, N. Ziviani, G. Marchionini, A. Moffat, and J. Tait, Eds. ACM, 2005, pp. 472–479.

[91] K. Mao, Z. Dou, H. Chen, F. Mo, and H. Qian, “Large language models know your contextual search intent: A prompting framework for conversational search,” CoRR, vol. abs/2303.06573, 2023.

[92] L. Gao, X. Ma, J. Lin, and J. Callan, “Precise zero-shot dense retrieval without relevance labels,” CoRR, vol. abs/2212.10496, 2022.

[93] R. Jagerman, H. Zhuang, Z. Qin, X. Wang, and M. Bendersky, “Query expansion by prompting large language models,” CoRR, vol. abs/2305.03653, 2023.

[94] X. Ma, Y. Gong, P. He, H. Zhao, and N. Duan, “Query rewriting for retrieval-augmented large lan-guage models,” CoRR, vol. abs/2305.14283, 2023.

[95] I. Mackie, I. Sekulic, S. Chatterjee, J. Dalton, and F. Crestani, “GRM: generative relevance modeling us-ing relevance-aware sample estimation for document retrieval,” CoRR, vol. abs/2306.09938, 2023.

[96] K. Srinivasan, K. Raman, A. Samanta, L. Liao, L. Bertelli, and M. Bendersky, “QUILL: query intent with large language models using retrieval augmentation and multi-stage distillation,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: EMNLP 2022 – Industry Track, Abu Dhabi, UAE, December 7 – 11, 2022, Y. Li and A. Lazaridou, Eds.     Association for Computational Linguistics, 2022, pp. 492–501.

[97] J. Feng, C. Tao, X. Geng, T. Shen, C. Xu, G. Long, D. Zhao, and D. Jiang, “Knowledge refinement via in-teraction between search engines and large language models,” CoRR, vol. abs/2305.07402, 2023.

[98] I. Mackie, S. Chatterjee, and J. Dalton, “Generative and pseudo-relevant feedback for sparse, dense and learned sparse retrieval,” CoRR, vol. abs/2305.07477, 2023.

[99] T. Shen, G. Long, X. Geng, C. Tao, T. Zhou, and D. Jiang, “Large language models are strong zero-shot retriever,” CoRR, vol. abs/2304.14233, 2023.

[100] M. Alaofi, L. Gallagher, M. Sanderson, F. Scholer, and P. Thomas, “Can generative llms create query variants for test collections? an exploratory study,” in Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2023, Taipei, Taiwan, July 23-27, 2023, H. Chen, W. E. Duh, H. Huang, M. P. Kato, J. Mothe, and B. Poblete, Eds. ACM, 2023, pp. 1869–1873.

[101] W. Yu, D. Iter, S. Wang, Y. Xu, M. Ju, S. Sanyal, C. Zhu, M. Zeng, and M. Jiang, “Generate rather than retrieve: Large language models are strong context generators,” in The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.

[102] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and L. Deng, “MS MARCO: A human generated machine reading comprehension dataset,” in CoCo@NIPS, ser. CEUR Workshop Proceedings, vol. 1773. CEUR-WS.org, 2016.

[103] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. P. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, M. Chang, A. M. Dai, J. Uszkoreit, Q. Le, and S. Petrov, “Natural questions: a benchmark for question answer-ing research,” Trans. Assoc. Comput. Linguistics, vol. 7, pp. 452–466, 2019.

[104] G. Izacard and E. Grave, “Leveraging passage retrieval with generative models for open domain question answering,” in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL 2021, Online, April 19 -23, 2021, P. Merlo, J. Tiedemann, and R. Tsarfaty, Eds. Association for Computational Linguistics, 2021, pp. 874–880.

[105] J. Thorne, A. Vlachos, C. Christodoulopoulos, and A. Mittal, “FEVER: a large-scale dataset for fact ex-traction and verification,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), M. A. Walker, H. Ji, and A. Stent, Eds.      Association for Computational Linguistics, 2018, pp. 809–819.

[106] D. Alexander, W. Kusa, and A. P. de Vries, “ORCAS-I: queries annotated with intent using weak supervi-sion,” in SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 – 15, 2022, E. Amigo, P. Castells, J. Gonzalo, B. Carterette, J. S. Culpepper, and G. Kazai, Eds. ACM, 2022, pp. 3057–3066.

[107] L. H. Bonifacio, H. Abonizio, M. Fadaee, and R. F. Nogueira, “Inpars: Data augmentation for information retrieval using large language models,” CoRR, vol. abs/2202.05144, 2022.

[108] V. Jeronymo, L. H. Bonifacio, H. Abonizio, M. Fadaee, R. de Alencar Lotufo, J. Zavrel, and R. F. Nogueira, “Inpars-v2: Large language models as efficient dataset generators for information retrieval,” CoRR, vol. abs/2301.01820, 2023.

[109] Z. Dai, V. Y. Zhao, J. Ma, Y. Luan, J. Ni, J. Lu, A. Bakalov, K. Guu, K. B. Hall, and M. Chang, “Promptagator: Few-shot dense retrieval from 8 examples,” in ICLR. OpenReview.net, 2023.

[110] R. Meng, Y. Liu, S. Yavuz, D. Agarwal, L. Tu, N. Yu, J.Zhang,M.Bhat,andY.Zhou,“Augtriever: Unsupervised dense retrieval by scalable data augmentation,” 2023.

[111] J. Saad-Falcon, O. Khattab, K. Santhanam, R. Florian, M. Franz, S. Roukos, A. Sil, M. A. Sultan, and C. Potts, “UDAPDR: unsupervised domain adaptation via LLM prompting and distillation of rerankers,” CoRR, vol. abs/2303.00807, 2023.

[112] Z. Peng, X. Wu, and Y. Fang, “Soft prompt tuning for augmenting dense retrieval with large language models,” 2023.

[113] D. S. Sachan, M. Lewis, D. Yogatama, L. Zettlemoyer, J. Pineau, and M. Zaheer, “Questions are all you need to train a dense passage retriever,” Transactions of the Association for Computational Linguistics, vol. 11, pp. 600–616, 2023.

[114] N. Thakur, N. Reimers, A. Ruckle, A. Srivastava, and I. Gurevych, “BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models,” in NeurIPS Datasets and Benchmarks, 2021.

[115] A. Neelakantan, T. Xu, R. Puri, A. Radford, J. M. Han, J. Tworek, Q. Yuan, N. Tezak, J. W. Kim, C. Hal-lacy, J. Heidecke, P. Shyam, B. Power, T. E. Nekoul, G. Sastry, G. Krueger, D. Schnurr, F. P. Such, K. Hsu, M. Thompson, T. Khan, T. Sherbakov, J. Jang, P. Welin-der, and L. Weng, “Text and code embeddings by contrastive pre-training,” CoRR, vol. abs/2201.10005, 2022.

[116] J. Ni, C. Qu, J. Lu, Z. Dai, G. H. Abrego, J. Ma, V. Y. Zhao, Y. Luan, K. B. Hall, M. Chang, and Y. Yang, “Large dual encoders are generalizable retrievers,” in EMNLP. Association for Computational Linguistics, 2022, pp. 9844–9855.

[117] A. Asai, T. Schick, P. S. H. Lewis, X. Chen, G. Izac-ard, S. Riedel, H. Hajishirzi, and W. Yih, “Task-aware retrieval with instructions,” in ACL (Findings). Asso-ciation for Computational Linguistics, 2023, pp. 3650– 3675.

[118] D. Metzler, Y. Tay, D. Bahri, and M. Najork, “Rethinking search: making domain experts out of dilettantes,” SIGIR Forum, vol. 55, no. 1, pp. 13:1–13:27, 2021.

[119] Y. Zhou, J. Yao, Z. Dou, L. Wu, and J. Wen, “Dynamic retriever: A pre-trained model-based IR system without an explicit index,” Mach. Intell. Res., vol. 20, no. 2, pp. 276–288, 2023.

[120] J. Chen, R. Zhang, J. Guo, Y. Liu, Y. Fan, and X. Cheng, “Corpusbrain: Pre-train a generative retrieval model for knowledge-intensive language tasks,” in Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, October 17-21, 2022, M. A. Hasan and L. Xiong, Eds. ACM, 2022, pp. 191–200.

[121] Y. Tay, V. Tran, M. Dehghani, J. Ni, D. Bahri, H. Mehta, Z. Qin, K. Hui, Z. Zhao, J. P. Gupta, T. Schuster, W. W. Cohen, and D. Metzler, “Transformer memory as a differentiable search index,” in NeurIPS, 2022.

[122] N. Ziems, W. Yu, Z. Zhang, and M. Jiang, “Large language models are built-in autoregressive search engines,” in Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, A. Rogers, J. L. Boyd-Graber, and N. Okazaki, Eds. Association for Computational Linguistics, 2023, pp. 2666–2678.

[123] J. Ju, J. Yang, and C. Wang, “Text-to-text multi-view learning for passage re-ranking,” in SIGIR.          ACM, 2021, pp. 1803–1807.

[124] R. Pradeep, R. F. Nogueira, and J. Lin, “The expando-mono-duo design pattern for text ranking with pre-trained sequence-to-sequence models,” CoRR, vol. abs/2101.05667, 2021.

[125] C. J. C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. N. Hullender, “Learn-ing to rank using gradient descent,” in ICML, ser. ACM International Conference Proceeding Series, vol. 119. ACM, 2005, pp. 89–96.

[126] H. Zhuang, Z. Qin, R. Jagerman, K. Hui, J. Ma, J. Lu, J. Ni, X. Wang, and M. Bendersky, “Rankt5: Fine-tuning T5 for text ranking with ranking losses,” CoRR, vol. abs/2210.10634, 2022.

[127] P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y. Zhang, D. Narayanan, Y. Wu, A. Ku-mar, B. Newman, B. Yuan, B. Yan, C. Zhang, C. Cos-grove, C. D. Manning, C. Re, D. Acosta-Navas, D. A. Hudson, E. Zelikman, E. Durmus, F. Ladhak, F. Rong, H. Ren, H. Yao, J. Wang, K. Santhanam, L. J. Orr, L. Zheng, M. Yuksekgonul, M. Suzgun, N. Kim, N. Guha, N. S. Chatterji, O. Khattab, P. Henderson, Q. Huang, R. Chi, S. M. Xie, S. Santurkar, S. Gan-guli, T. Hashimoto, T. Icard, T. Zhang, V. Chaudhary, W. Wang, X. Li, Y. Mai, Y. Zhang, and Y. Koreeda, “Holistic evaluation of language models,” CoRR, vol. abs/2211.09110, 2022.

[128] D. S. Sachan, M. Lewis, M. Joshi, A. Aghajanyan, W. Yih, J. Pineau, and L. Zettlemoyer, “Improving pas-sage retrieval with zero-shot question generation,” in EMNLP. Association for Computational Linguistics, 2022, pp. 3781–3797.

[129] S. Cho, S. Jeong, J. Seo, and J. C. Park, “Discrete prompt optimization via constrained generation for zero-shot re-ranker,” in ACL (Findings). Association for Computational Linguistics, 2023, pp. 960–971.

[130] W. Sun, L. Yan, X. Ma, P. Ren, D. Yin, and Z. Ren, “Is chatgpt good at search? investigating large language models as re-ranking agent,” CoRR, vol. abs/2304.09542, 2023.

[131] X. Ma, X. Zhang, R. Pradeep, and J. Lin, “Zero-shot listwise document reranking with a large language model,” CoRR, vol. abs/2305.02156, 2023.

[132] Z. Qin, R. Jagerman, K. Hui, H. Zhuang, J. Wu, J. Shen, T. Liu, J. Liu, D. Metzler, X. Wang et al., “Large lan-guage models are effective text rankers with pairwise ranking prompting,” arXiv preprint arXiv:2306.17563, 2023.

[133] F. Ferraretto, T. Laitz, R. de Alencar Lotufo, and R. F. Nogueira, “Exaranker: Explanation-augmented neu-ral ranker,” CoRR, vol. abs/2301.10521, 2023.

[134] L. Boytsov, P. Patel, V. Sourabh, R. Nisar, S. Kundu, R. Ramanathan, and E. Nyberg, “Inpars-light: Cost-effective unsupervised training of efficient rankers,” CoRR, vol. abs/2301.02998, 2023.

[135] A. Askari, M. Aliannejadi, E. Kanoulas, and S. Verberne, “Generating synthetic documents for cross-encoder re-rankers: A comparative study of chatgpt and human experts,” CoRR, vol. abs/2305.02320, 2023.

[136] H. Wachsmuth, S. Syed, and B. Stein, “Retrieval of the best counterargument without prior topic knowl-edge,” in ACL (1).  Association for Computational Linguistics, 2018, pp. 241–251.

[137] K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang, “REALM: retrieval-augmented language model pre-training,” CoRR, vol. abs/2002.08909, 2020.

[138] P. S. H. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Kuttler, M. Lewis, W. Yih, T. Rocktaschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive NLP tasks,” in Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., 2020.

[139] W. Shi, S. Min, M. Yasunaga, M. Seo, R. James, M. Lewis, L. Zettlemoyer, and W. Yih, “REPLUG: retrieval-augmented black-box language models,” CoRR, vol. abs/2301.12652, 2023.

[140] G. Izacard, P. S. H. Lewis, M. Lomeli, L. Hosseini, F. Petroni, T. Schick, J. Dwivedi-Yu, A. Joulin, S. Riedel, and E. Grave, “Few-shot learning with retrieval augmented language models,” CoRR, vol. abs/2208.03299, 2022.

[141] A. Lazaridou, E. Gribovskaya, W. Stokowiec, and N. Grigorev, “Internet-augmented language models through few-shot prompting for open-domain question answering,” CoRR, vol. abs/2203.05115, 2022.

[142] H. He, H. Zhang, and D. Roth, “Rethinking with retrieval: Faithful large language model inference,” CoRR, vol. abs/2301.00303, 2023.

[143] J. Liu, J. Jin, Z. Wang, J. Cheng, Z. Dou, and J. Wen, “RETA-LLM: A retrieval-augmented large language model toolkit,” CoRR, vol. abs/2306.05212, 2023.

[144] O. Ram, Y. Levine, I. Dalmedigos, D. Muhlgay, A. Shashua, K. Leyton-Brown, and Y. Shoham, “In-contextretrieval-augmentedlanguagemodels,”CoRR, vol. abs/2302.00083, 2023.

[145] H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal, “Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, A. Rogers, J. L. Boyd-Graber, and N. Okazaki, Eds. Association for Computational Linguistics, 2023, pp. 10014–10037.

[146] Z. Jiang, F. F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi-Yu, Y. Yang, J. Callan, and G. Neubig, “Active retrieval augmented generation,” CoRR, vol. abs/2305.06983, 2023.

[147] S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Hatfield-Dodds, N. DasSarma, E. Tran-Johnson, S. Johnston, S. E. Showk, A. Jones, N. Elhage, T. Hume, A. Chen, Y. Bai, S. Bowman, S. Fort, D. Ganguli, D. Hernandez, J. Ja-cobson, J. Kernion, S. Kravec, L. Lovitt, K. Ndousse, C. Olsson, S. Ringer, D. Amodei, T. Brown, J. Clark, N. Joseph, B. Mann, S. McCandlish, C. Olah, and J. Kaplan, “Language models (mostly) know what they know,” CoRR, vol. abs/2207.05221, 2022.

[148] Z. Jiang, J. Araki, H. Ding, and G. Neubig, “How can we know When language models know? on the calibration of language models for question answering,” Trans. Assoc. Comput. Linguistics, vol. 9, pp. 962–977, 2021.

[149] O. Press, M. Zhang, S. Min, L. Schmidt, N. A. Smith, and M. Lewis, “Measuring and narrowing the com-positionality gap in language models,” CoRR, vol. abs/2210.03350, 2022.

[150] O. Khattab, K. Santhanam, X. L. Li, D. Hall, P. Liang, C. Potts, and M. Zaharia, “Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive NLP,” CoRR, vol. abs/2212.14024, 2022.

[151] O. Yoran, T. Wolfson, B. Bogin, U. Katz, D. Deutch, and J. Berant, “Answering questions by meta-reasoning over multiple chains of thought,” CoRR, vol. abs/2304.13007, 2023.

[152] Y. Qin, Z. Cai, D. Jin, L. Yan, S. Liang, K. Zhu, Y. Lin, X. Han,N. Ding,H. Wang,R. Xie, F.Qi, Z.Liu, M.Sun, and J. Zhou, “Webcpm: Interactive web search for chinese long-form question answering,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, A. Rogers, J. L. Boyd-Graber, and N. Okazaki, Eds.     Association for Computational Linguistics, 2023, pp. 8968–8988.

[153] S. MacAvaney, C. Macdonald, R. Murray-Smith, and I. Ounis, “Intent5: Search result diversification using causal language models,” CoRR, vol. abs/2108.04026, 2021.

[154] R. Ren, Y. Wang, Y. Qu, W. X. Zhao, J. Liu, H. Tian, H. Wu, J. Wen, and H. Wang, “Investigating the factual knowledge boundary of large language models with retrieval augmentation,” CoRR, vol. abs/2307.11019, 2023.

[155] N. Craswell, “Mean reciprocal rank,” in Encyclopedia of Database Systems, L. Liu and M. T. Ozsu, Eds. Springer US, 2009, p. 1703.

[156] K. Jarvelin and J. Kekalainen, “Cumulated gain-based evaluation of IR techniques,” ACM Trans. Inf. Syst., vol. 20, no. 4, pp. 422–446, 2002.

[157] K. Papineni, S. Roukos, T. Ward, and W. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA. ACL, 2002, pp. 311–318.

[158] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Text Summarization Branches Out. Barcelona, Spain: Association for Computational Linguistics, Jul. 2004, pp. 74–81.

[159] P. Manakul, A. Liusie, and M. J. F. Gales, “Selfcheck-gpt: Zero-resource black-box hallucination detection for generative large language models,” CoRR, vol. abs/2303.08896, 2023.

[160] H. Qian, Y. Zhu, Z. Dou, H. Gu, X. Zhang, Z. Liu, R. Lai, Z. Cao, J. Nie, and J. Wen, “Webbrain: Learning to generate factually correct articles for queries by grounding on large web corpus,” CoRR, vol. abs/2304.04358, 2023.

[161] J. Li, X. Cheng, W. X. Zhao, J. Nie, and J. Wen, “Halueval: A large-scale hallucination evaluation benchmark for large language models,” CoRR, vol. abs/2305.11747, 2023.

[162] J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein, “Generative agents: Interactive simulacra of human behavior,” CoRR, vol. abs/2304.03442, 2023.