Skip to main content
Uncategorized

On the Planning, Search, and Memorization Capabilities of Large Language Models

Sep 5, 2023

Yunhao Yang

Department of Computer Science University of Texas at Austin Austin, TX 78705 yunhaoyang234@utexas.edu

Anshul Tomar

Department of Computer Science University of Texas at Austin Austin, TX 78705

anshulmanas@gmail.com

Abstract

The rapid advancement of large language models, such as the Generative Pre- trained Transformer (GPT) series, has had significant implications across various disciplines. In this study, we investigate the potential of the state-of-the-art large language model (GPT-4) for planning tasks. We explore its effectiveness in multiple planning subfields, highlighting both its strengths and limitations. Through a comprehensive examination, we identify areas where large language models excel in solving planning problems and reveal the constraints that limit their applicability. Our empirical analysis focuses on GPT-4’s performance in planning domain extraction, graph search path planning, and adversarial planning. We then propose a way of fine-tuning a domain-specific large language model to improve its Chain of Thought (CoT) capabilities for the above-mentioned tasks. The results provide valuable insights into the potential applications of large language models in the planning domain and pave the way for future research to overcome their limitations and expand their capabilities.

1        Introduction

The fast growth of large language models, such as the Generative Pretrained Transformer (GPT) series, significantly impacts various disciplines, from natural language processing and artificial intelligence to healthcare [Chintagunta et al., 2021, Nori et al., 2023, Thirunavukarasu et al., 2023], finance [Leippold, 2023, Wu et al., 2023], and beyond [Shin et al., 2020]. These models have revolutionized tasks such as machine translation, sentiment analysis, text summarization, and question-answering, enhancing human-computer interactions and enabling more efficient and accurate information retrieval. In addition, the vast amounts of data these models are trained on allow them to generate human-like responses and perform tasks that were once considered exclusive to human intelligence.

We examine the capability of the current state-of-the-art language model—GPT-4—on planning and search [OpenAI, 2023]. Despite its impressive performance in natural language processing tasks and its ability to generate human-like text, GPT-4 is not explicitly designed for executing planning or search algorithms. However, it can provide valuable insights and guidance on various planning and search techniques and domain-specific knowledge for constructing heuristics or evaluating different approaches. GPT-4’s vast knowledge base allows users to ask questions and explore diverse aspects of planning and search.

We indicate the fields in planning that can be solved by large language models and the limitations of language models. The introduction of large language models significantly impacts many fields, such as natural language processing; hence we want to examine its impact on the field of planning.

This version is a project report from CS 395T Planning Search and Reasoning.

Existing works [Valmeekam et al., 2022, Huang et al., 2022b, Singh et al., 2023, Lin et al., 2023] demonstrating the capability of language models on planning are heavily focused on plan generation but lack the exploration of path search, memorization in planning, and planning in adversarial settings.

We provide a comprehensive examination of the capability of GPT-4 in the field of planning and indicate its limitations for future research. Additionally, we attempt to improve the performance of an LLM by fine-tuning it on tasks like planning domain, graph search, and adversarial search to see if we are able to improve the predictions of these models for these aforementioned tasks. These models are capable of addressing various planning tasks, such as providing general information on planning algorithms, generating heuristics, and discussing different planning techniques. However, they are not specifically designed to perform planning tasks directly, as their primary function is to understand and generate text. Language models have limitations in handling real-time interactive scenarios and lack the ability to learn and adapt beyond their training data. Consequently, while large language models can provide valuable insights and guidance in the realm of planning, their utility is constrained by these limitations, and they cannot fully replace specialized planning algorithms or tools designed to address specific planning problems.

We provide an empirical analysis of how GPT-4 performs on planning domain extraction, graph search path planning, and adversarial planning. We found that GPT-4 is effective at extracting key components of planning domains from textual descriptions, allowing for the generation of structured representations suitable for use in automated planning systems. In graph search, GPT-4 exhibits the capability to understand the searching algorithm and find an optimal path based on the algorithm. However, such capability is limited once the graph becomes complicated. Moreover, we show its capability to generate heuristics for adversarial planning and its limitation in performing adversarial search algorithms. The lack of memorization during planning is a main factor that limits the large language model to planning in adversarial settings.

2        Related Work

Several works have used large language models for zero-shot planning; however, their planning either assumes the planning domain is acquired, or the outcomes are static. Some works [Yang et al., 2022, Ichter et al., 2022] only generate static outcomes, while LLM-Planner [Song et al., 2022] and LM-Nav [Shah et al., 2022] require prior knowledge of specific fields to define the planning domain. Existing works [Huang et al., 2022a, Yang et al., 2022] have demonstrated the capability of these models. Large language models are sources with a wide range of knowledge, including domain-specific knowledge. However, existing works have not dived into the planning and searching capabilities of these models, especially in complex problem or adversarial settings.

In this work, we explore the capabilities of large language models on planning domain generation, graph search, planning state memorization, and adversarial planning. The work reveals some limitations of large language models, which lead to potential future directions for improving these models.

3        Preliminaries

Large Language Models (LLMs). LLMs are machine learning models designed to process and understand natural language, such as human speech and text. These models are typically large-scale neural networks, trained using massive amounts of data, often on the scale of billions of words or more, to learn patterns and structures in language.

LLMs are capable of a wide range of natural language processing tasks, such as language translation, sentiment analysis, text classification, and speech recognition. They can generate new text based on the input prompt they received and create original content such as news articles, essays, or even poetry.

One example of a large language model is OpenAI’s GPT (short for “Generative Pre-trained Trans- former”) series [Brown et al., 2020, OpenAI, 2023], with GPT-4 being the most current iteration. Compared to existing LLMs, GPT-4 is also able to understand image inputs and perform better on logic reasoning. These models have demonstrated remarkable performance across a wide range of NLP tasks, revolutionizing the field of AI and enabling new applications in various domains.

Planning Domain. A planning domain refers to a formal description of a specific problem space or environment [Haslum et al., 2019]. It consists of the rules, constraints, and actions that define the structure of the problem and the ways in which it can be solved. The goal of automated planning is to find a sequence of actions that can transform the initial state of the domain into a desired goal state.

A planning domain generally consists of the following components:

  • Objects: The entities or items that exist within the domain, such as people, locations, or resources.
  • States: A description of the various conditions or configurations of the objects in the domain.
  • Actions: The operations or steps that can be taken to modify the state of the domain. Actions usually have preconditions that must be satisfied before they can be executed and effects that describe how the state changes when the action is performed.
  • Initial state: The starting configuration of the domain from which the planning process begins.
  • Goal state: The desired configuration or set of conditions that the planning process aims to achieve.

Planning Domain Definition Language (PDDL) is a formal language used to describe planning problems and domains in the field of automated planning. PDDL separates the description of a planning problem into two parts: the domain and the problem. The domain defines the general structure of the problem, including the available actions and their effects, while the problem specifies the initial state and the goal state for a particular instance of the problem.

In addition to the components of general planning domains, PDDL consists of a set of predicates, which is a set of properties or relations that describe the state of the objects in the domain.

Graph Search. Graph search is a type of algorithm used to explore and navigate graphs, which are mathematical structures consisting of nodes (also called vertices) connected by edges. In graph search, the algorithm starts at a given node and systematically explores the graph by visiting its neighboring nodes in a specific order until it reaches a target node or a goal state.

The goal of graph search is to find the shortest or most efficient path between two nodes in a graph. There are several different types of graph search algorithms, including:

  • Breadth-first search (BFS): This algorithm explores all the neighbors of a node before moving on to the next level of nodes. BFS is guaranteed to find the shortest path between two nodes in an unweighted graph.
  • Depth-first search (DFS): This algorithm explores one branch of the graph as far as possible before backtracking and exploring another branch. DFS can be used to find all paths between two nodes in a graph, but it may not find the shortest path.
  • Dijkstra’s algorithm: This algorithm is used to find the shortest path between two nodes in a weighted graph. It works by assigning a tentative distance to each node and updating the distance as it explores the graph.
  • A* search: This algorithm is similar to Dijkstra’s algorithm but uses a heuristic function to guide the search toward the goal node. A* search is often used in pathfinding in video games.

Graph search algorithms can be used to solve a wide range of problems, but the choice of algorithm depends on the specific problem and the characteristics of the graph being searched.

Adversarial Planning. Adversarial planning is a type of planning problem where the planner is required to generate a plan that can anticipate and react to the actions of an adversarial agent. In this type of problem, the planner must take into account the actions of the adversary and try to find a plan that maximizes the chances of success while minimizing the impact of the adversary’s actions.

Adversarial planning is commonly used in game theory, where it is used to model the strategies and actions of two or more players engaged in a game. In this context, the planner must anticipate the actions of the opponent and develop a strategy that maximizes the chances of winning.

There are several approaches to adversarial planning, including mini-max, in which the goal of the agents is to maximize their own rewards or utility while minimizing the rewards or utility of their opponents, and Monte Carlo Tree Search (MCTS), which uses a search algorithm to simulate the possible outcomes of the planner’s actions and the adversary’s responses.

Adversarial planning is a challenging problem because it requires the planner to consider not only their own objectives but also the objectives and capabilities of the adversary agents. As a result, it often involves complex decision-making and requires sophisticated algorithms and techniques.

4        Planning Domain Generation

In this section, we formulate an approach to generating planning domains using the large language model. To generate the planning domain, we apply the following procedure: querying a brief description of a task to GPT and transforming the responses into PDDL. If we have prior task knowledge, we can send it to GPT and ask it to generate PDDL from the knowledge. This approach enables the task designers to obtain task knowledge in a formal representation, regardless of the prior information the task designers have. Moreover, after generating the planning domains, the task designer can obtain a task plan by performing a simple path search. We also show the capability of GPT-4 on path search in the later sections.

We start the experiment with a daily-life task—cross the road—to examine the planning domain generation ability. We send the following input prompt to GPT-4:

Define a problem and actions for a task ” cross the road at traffic light” in PDDL .  

The PDDL outputs generated by GPT-4 is presented in Listing 1.

Listing 1: Define a problem and actions for a task “cross the road at traffic light” in PDDL

Then, we query GPT-4 to find a plan by searching through the planning domain:

The result indicates that the generated planning domain is self-contained, and we can obtain a formal representation of the plan, which solves a zero-shot planning problem.

In addition to daily-life tasks, we can ask GPT-4 to generate the planning domain for some well- known games, such as Tic-Tac-Toe in Listing 2 and chess in Listing 3. However, once the complexity of the game increases, the success rate of generating self-contained planning domains from GPT-4 decreases. A failure example is the chess game in Listing 2, which defines the wrong goal state.

Listing 2: Define a problem and a set of actions for tic-tac-toe in PDDL.

Listing 3: Define a problem and a set of actions for the chess game in PDDL.

For more empirical results, we select 100 tasks with different complexities. The tasks are ranged from board games to daily tasks to domain-specific tasks. Then, we query GPT-4 to generate planning domains for those tasks and check the correctness of the generated domains. We show the results in

Table 1: Results on Planning Domain Generation using GPT-4. A correct plan means the planning domain is self-contained and matches human knowledge. A wrong plan means the planning domain is self-contained but does not match human knowledge (e.g., chess game in Listing 3). A failed plan means the planning domain is not self-contained due to the inconsistency of predicates.

Total TasksCorrect PlanWrong PlanFailed
10073270

Table 2: Results on path search using GPT-4. Length indicates the length of the plan generated by fast-downward using the planning domains from GPT-4. The number of tasks indicates how many tasks can be completed in this range of steps. A plan is considered correct if the plan generated by GPT-4 is identical to the plan from fast-downward.

Length34—66—88
Number of Tasks Number of Correct Plans68 6810 87 415 3

Table 1. As we can see, GPT-4 can always generate self-contained planning domains but occasionally generate planning domains that do not match human knowledge.

Additionally, we further query GPT-4 to solve the planning problem given those generated domains. Since all the planning domains are self-contained, we also run fast-downward planner to find a plan and compare it with the plan generated from GPT-4. The results in Table 2 indicate that GPT-4 can find plans for simple tasks, but once the task requires more steps, GPT-4 may generate plans with missing disordered actions.

In conclusion, large language models like GPT-4 are useful in planning domain generation. Due to the rich knowledge encoded in these models, we can use them as a knowledge source, with a notice that they are not always reliable for complex tasks. In the path search aspect, GPT-4 can solve very simple path search problems from given planning domains. However, there is no significant advantage to using GPT-4 compared to using a traditional planner.

5        Graph Search

In this section, we examine the capability of large language models, such as GPT-4, on graph searches. The examination consists of two aspects: first, whether GPT-4 understands the well-known graph search algorithms, and second, whether GPT-4 can follow the algorithms to find the desired path. Note that we compare the outputs of GPT-4 to the outputs of the graph search algorithms. The outputs are not necessarily the optimal path. We consider GPT-4 to be accurate as long as it can generate paths following the graph search algorithms.

We collect direct weighted graphs from 5 nodes to 95 nodes with a gap of 10 (5, 15, 25, …, 85, 95 nodes). For each number of nodes, we collect 20 different graphs. An example is presented in Figure 1. For each graph, we query GPT-4 to perform three graph search algorithms, depth-first search, breath-first search, and Dijkstra’s algorithm, respectively, to generate paths. For the example in Figure 1, we query:

Figure 1: A randomly generated direct weighted graph for examining the graph search capability of large language models.
Figure 2: The accuracy of graph searching results generated by GPT-4 on graphs with different nodes.

As we indicate in this example, GPT-4 is sufficient to generate accurate paths under all three algorithms. However, as the graphs become more complicated, the accuracy of GPT-4 decreases. In Figure 2, we present how the accuracies of GPT-4 on three algorithms decrease as the number of graph nodes increases. Therefore, we conclude that GPT-4 is only capable of simple graph search. But it provides the possibility of decomposing a complex graph into simple graphs and performing graph search.

6        Adversarial Planning

In this section, we explore the capability of the large language model, specifically GPT-4, on adversarial planning. The experiment consists of two components: defining heuristics and applying adversarial search.

First, we choose the simple game Tic-Tac-Toe as an example and query GPT-4 for a proper heuristic:

The language model successfully generates a heuristic for adversarial planning in Tic-Tac-Toe. We then manually implement this heuristic and perform the Mini-max algorithm. It turns out that the output of GPT-4 is a workable heuristic.

Second, we examine the performance of GPT-4 on adversarial planning. We query GPT-4 for playing the game with the heuristic defined above:

In this example, GPT-4 fails to identify possible moves and memorizes the sequence of previous states, i.e., the existing pieces on the board. Moreover, even with the misidentified moves, it fails to compute the heuristics. From this observation, we can also claim that GPT-4 is unable to understand the heuristics. Therefore, it has limited capability of graph search algorithms like A* as well.

Due to the failure of adversarial planning in Tic-Tac-Toe, we stopped examining its capability on more complicated tasks and derived our conclusion.

In conclusion, LLMs are capable of generating reasonable heuristics for the adversarial planning of simple games. We examine this capability in the tic-tac-toe example, where the AI-driven heuristic allows for evaluating board positions based on the presence of a player’s symbols in rows, columns, and diagonals. LLMs encode rich knowledge and can provide reasonable heuristics for some given tasks.

This simple yet heuristic demonstrates the potential of using AI-generated heuristics to guide and enhance decision-making in various problem domains. In conjunction with adversarial search algorithms such as Minimax or Alpha-Beta pruning, these heuristics enable the creation of AI opponents that can effectively compete against human players in simple games like Tic-Tac-Toe.

However, LLMs cannot perform adversarial searching algorithms like Mini-max or Monte-Carlo Tree Search. Due to the fact that GPTs are a series of language models for next-word prediction, they can neither understand the state of the game nor search over all the possibilities (as we addressed the graph search limitation in the last section). Additionally, the language model cannot memorize the sequence of previous states correctly. These factors raise a limitation of LLMs and could potentially be a direction of improvement.

7        Fine-tuning LLM for Logical Reasoning

Given the subpar performance of LLM on logical reasoning tasks like adversarial planning, we can fine-tune our own language model to check if we could improve its performance on logistic reasoning tasks.

7.1       Dataset

The dataset collected comprised three parts which were planning domain generation (7 different tasks), graph search (20 different tasks), and adversarial planning (4 different tasks). For example, for planning domain generation, we queried GPT-4 using seven different problem definitions. Each problem definition generated 10-100 different goal state configurations depending on the problem, resulting in a total of 540 queries. Given each query (only those queries were selected, which we thought would give correct results when passed through LLM), we ran GPT-4 inference on them to get the soft labels for fine-tuning our own LLM. We collected around 1300 queries (appended with the name of the part, e.g., planning domain queries were appended by planning domain : and so on) and soft label pairs across all tasks.

7.2       Model Selection and Fine-tuning

To select the correct model to fine-tune, we chose a model small enough that could be easily fine-tuned using the resources available to us and also large enough such that it could infer logically. We chose the Flan T5 base model [Shen et al., 2023] released by Google since it meets this criterion. One of the reasons we chose this model was because this model’s checkpoints were readily available at HuggingFace, and had a reasonable size of 240M parameters. Also, as shown in [Shen et al., 2023], the model shows the SOTA performance in the CoT dataset [Qingyi Si, 2023], which contains a chain of thought data points like arithmetic reasoning, explanation generation, etc., for the number of parameters it had. For fine-tuning the model, we froze the weights of the original model. We only changed the final layer’s weights. Updating the weights of the entire model could have led to catastrophic forgetting, or it would have been fine-tuned properly since our dataset size was too small.

7.3       Results

We fine-tuned our LLM model on approximately 1000 data points and evaluated it on the remaining 200 data points. We used the remaining 100 data points as the validation set and used it to stop training the model when we started to see an increase in the validation loss. Given the rapid advances in LLM models, we observed that the outputs of GPT-4 were much better than the Flan models, which were more or less incorrect for every data point we had. In Table 3, we compare the outputs of Flan, fine-tuned Flan, and GPT-4 for a specific case of graph search and planning domain generation. As for adversarial search, the flan models are unable to come up with a coherent heuristic, and hence we skip their evaluations. Also, we could not compare the output of the LLM and the ground truth values after planning since all the planning domains provided by the Flan were incorrect. In conclusion, we were able to see minor improvements due to fine-tuning because the original model might have never seen prompts like these but was unable to process the outputs of the fine-tuned models.

8        Conclusion

Large language models can play critical roles in planning due to their large knowledge domain. The up-to-date large language models encode rich real-world knowledge and can make logical inferences to a certain extent. We provide examples to demonstrate that these models can generate self-consistent planning domains of given tasks without any prior information provided. This capability enables language models to do zero-shot planning. Moreover, the language models can perform graph searches on small-scaled graphs, indicating their great potential in searching. However, the current models have limited abilities to memorize the sequence of previous states during planning and to solve search problems in complicated environments (graphs). Both limitations lead the large language models incapable of adversarial planning. Overall, large language models can play significant roles in planning, especially few-shot planning, and their significance can be improved over time.

Table 3: The comparisons between the outputs of various models for planning domains and graph search.

References

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.

Bharath Chintagunta, Namit Katariya, Xavier Amatriain, and Anitha Kannan. Medically aware gpt-3 as a data generator for medical dialogue summarization. In Machine Learning for Healthcare Conference, pages 354–372. PMLR, 2021.

Patrik Haslum, Nir Lipovetzky, Daniele Magazzeni, and Christian Muise. An introduction to the planning domain definition language. Synthesis Lectures on Artificial Intelligence and Machine Learning, 2019.

Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero- shot planners: Extracting actionable knowledge for embodied agents. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 9118–9147. PMLR, 2022a. URL https://proceedings.mlr.press/v162/huang22a.html.

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Tomas Jackson, Noah Brown, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner monologue: Embodied reasoning through planning with language models. 205:1769–1782, 2022b.

Brian Ichter, Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, Dmitry Kalashnikov, Sergey Levine, Yao Lu, Carolina Parada, Kanishka Rao, Pierre Sermanet, Alexander Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Mengyuan Yan, Noah Brown, Michael Ahn, Omar Cortes, Nicolas Sievers, Clayton Tan, Sichun Xu, Diego Reyes, Jarek Rettinghouse, Jornell Quiambao, Peter Pastor, Linda Luu, Kuang-Huei Lee, Yuheng Kuang, Sally Jesmonth, Nikhil J. Joshi, Kyle Jeffrey, Rosario Jauregui Ruano, Jasmine Hsu, Keerthana Gopalakrishnan, Byron David, Andy Zeng, and Chuyuan Kelly Fu. Do as I can, not as I say: Grounding language in robotic affordances. In Karen Liu, Dana Kulic, and Jeffrey Ichnowski, editors, Conference on Robot Learning, CoRL 2022, 14-18 December 2022, Auckland, New Zealand, volume 205 of Proceedings of Machine Learning Research, pages 287–318. PMLR, 2022. URL https://proceedings.mlr.press/ v205/ichter23a.html.

Markus Leippold. Thus spoke gpt-3: Interviewing a large-language model on climate finance. Finance Research Letters, 53:103617, 2023.

Bill Yuchen Lin, Chengsong Huang, Qian Liu, Wenda Gu, Sam Sommerer, and Xiang Ren. On grounded planning for embodied tasks with language models. In Brian Williams, Yiling Chen, and Jennifer Neville, editors, AAAI Conference on Artificial Intelligence, pages 13192–13200, Washington, DC, USA, 2023. AAAI Press.

Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375, 2023.

OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi:10.48550/arXiv.2303.08774. URL  https://doi.org/10.48550/arXiv.2303.08774.

Zheng Lin Qingyi Si. Alpaca-cot: An instruction fine-tuning platform with instruction data collection and unified large language models interface. https://github.com/PhoebusSi/alpaca-CoT, 2023.

Dhruv Shah, Blazej Osinski, Brian Ichter, and Sergey Levine. Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. In Karen Liu, Dana Kulic, and Jeffrey Ichnowski, editors, Conference on Robot Learning, CoRL 2022, 14-18 December 2022, Auckland, New Zealand, volume 205 of Proceedings of Machine Learning Research, pages 492–504. PMLR, 2022. URL https://proceedings.mlr.press/v205/shah23b.html.

Sheng Shen, Le Hou, Yanqi Zhou, Nan Du, Shayne Longpre, Jason Wei, Hyung Won Chung, Barret Zoph, William Fedus, Xinyun Chen, Tu Vu, Yuexin Wu, Wuyang Chen, Albert Webson, Yunxuan Li, Vincent Zhao, Hongkun Yu, Kurt Keutzer, Trevor Darrell, and Denny Zhou. Flan-moe: Scaling instruction-finetuned language models with sparse mixture of experts. CoRR, abs/2305.14705, 2023. doi:10.48550/arXiv.2305.14705.  URL https://doi.org/10.48550/arXiv.2305.14705.

Hoo-Chang Shin, Yang Zhang, Evelina Bakhturina, Raul Puri, Mostofa Patwary, Mohammad Shoeybi, and Raghav Mani. Biomegatron: Larger biomedical domain language model. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 4700–4706. Association for Computational Linguistics, 2020. doi:10.18653/v1/2020.emnlp- main.379. URL https://doi.org/10.18653/v1/2020.emnlp-main.379.

Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language models. In IEEE International Conference on Robotics and Automation, ICRA 2023, London, UK, May 29 – June 2, 2023, pages 11523–11530. IEEE, 2023.

Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M. Sadler, Wei-Lun Chao, and Yu Su. Llm- planner: Few-shot grounded planning for embodied agents with large language models. CoRR, abs/2212.04088, 2022. doi:10.48550/arXiv.2212.04088. URL https://doi.org/10.48550/ arXiv.2212.04088.

Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. Large language models in medicine. Nature Medicine, pages 1–11, 2023.

Karthik Valmeekam, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). arXiv preprint arXiv:2206.10498, 2022.

Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prab- hanjan Kambadur, David S. Rosenberg, and Gideon Mann. Bloomberggpt: A large language model for finance. CoRR, abs/2303.17564, 2023. doi:10.48550/arXiv.2303.17564. URL https://doi.org/10.48550/arXiv.2303.17564.

Yunhao Yang, Jean-Raphaël Gaglione, Cyrus Neary, and Ufuk Topcu. Automaton-based repre- sentations of task knowledge from generative language models. CoRR, abs/2212.01944, 2022. doi:10.48550/arXiv.2212.01944.  URL https://doi.org/10.48550/arXiv.2212.01944.

A         Appendix

A.1       Planning Domain Examples

Listing 4: Define a problem and a set of actions for the chess game in PDDL.