Sep 12, 2023
Nunzio Lorè Network Science Institute
Multi-Agent Intelligent Complex Systems (MAGICS) Lab Northeastern University, Boston, Massachusetts, USA email@example.com
College of Engineering and Network Science Institute Multi-Agent Intelligent Complex Systems (MAGICS) Lab Northeastern University, Boston, Massachusetts, USA firstname.lastname@example.org
This paper investigates the strategic decision-making capabilities of three Large Language Models (LLMs): GPT-3.5, GPT-4, and LLaMa-2, within the framework of game theory. Utilizing four canonical two-player games—Prisoner’s Dilemma, Stag Hunt, Snowdrift, and Prisoner’s Delight—we explore how these models navigate social dilemmas, situations where players can either cooperate for a collective benefit or defect for individual gain. Crucially, we extend our analysis to examine the role of contextual framing, such as diplomatic relations or casual friendships, in shaping the models’ decisions. Our findings reveal a complex landscape: whileGPT-3.5ishighlysensitivetocontextualframing,itshowslimited ability to engage in abstract strategic reasoning. Both GPT-4 and LLaMa-2 adjust their strategies based on game structure and context, but LLaMa-2 exhibits a more nuanced understanding of the games’ underlying mechanics. These results highlight the current limitations and varied proficiencies of LLMs in strategic decision-making,cautioningagainsttheirunqualifieduseintasksrequiringcomplexstrategic reasoning.
Large Language Models (LLMs) such as GPT from OpenAI and LLaMa-2 from Meta have garnered significant attention for their ability to perform a range of human-like tasks that extend far beyond simple conversation. Some argue that these models may serve as an intermediate step toward Artificial General Intelligence (AGI) . Recent advancements have shown GPT-4 passing the bar exam  and GPT-3 solving complex mathematical problems . Despite these achievements, these models exhibit limitations, notably in tasks like network structure recognition .
Social and behavioral science research on Large Language Models (LLMs), including GPT and LLaMa-2, is divided into two principal streams: one that explores human-like cognitive capabilities such as reasoning and theory of mind [5, 6, 7, 8, 9], and another that evaluates performance in comparison to human skills across a variety of tasks [10, 11, 12]. In the field of economics, the emphasis is predominantly on performance evaluation, exploring applications like market research and sentiment analysis [13, 14, 15]. This dual focus coalesces in social science research, where LLMs have gained attention for their potential to simulate human behavior in experimental settings [16, 17, 18, 19]. Notably, within the intricate framework of social dilemmas and game theory, LLMs are being tested for both their cognitive reasoning skills and performance outcomes [20, 21, 22, 23].
Existing studies indicate that LLMs can mimic human behavior to some extent [22, 21], yet their aptitude for strategic decision-making in game-theoretic contexts is still an area for exploration. Beyond the structural elements of a game, the contextual framing can significantly affect decision-making processes. Prior research on human behavior has underlined the powerful role of context in shaping strategic choices; for example, the framing of a game as a Wall Street venture versus a community endeavor led to divergent decisions . As a result, our study aims to go beyond assessing the fundamental strategic capabilities of LLMs, also considering the influence of game structure and contextual framing on their decision-making.
To disentangle the complexities of strategic decision-making in LLMs, we conduct a series of game-theoretic simulations on three distinct models: GPT-3.5, GPT-4, and LLaMa-2. We focus on social dilemmas, games in which players may either cooperate for collective benefit or defect for individual gain. Starting from the well-known Prisoner’s Dilemma, we expand our study to include other two-player games such as the Stag Hunt, Snowdrift, and Prisoner’s Delight (aka Harmony Game). Besides examining these games, we introduce five different contexts—ranging from business and diplomatic discussions to casual interactions between friends—to evaluate how contextual framing influences strategic choices. Our primary research question is to determine the relative significance of game structure versus contextual framing in shaping the behavior of these models.
Our findings unveil the subtle intricacies in how each of the examined Large Language Models responds to strategic scenarios. GPT-3.5 appears particularly sensitive to contextual framing but demonstrates limited proficiency in grasping abstract strategic considerations, such as reasoning based on a best response strategy. In contrast, both GPT-4 and LLaMa-2 exhibit a more balanced approach, adjusting their strategies based on both the intrinsic game structure and the contextual framing. Notably, the impact of context is more pronounced in specific domains, such as interactions framed as games among friends, where the game structure itself takes a backseat.
When it comes to comparing GPT-4 and LLaMa-2, our findings reveal that GPT-4, on average, places greater weight on the game structure than on context, relative to LLaMa-2. However, prioritizing game structure over context does not translate to a nuanced differentiation between distinct game types. In fact, GPT-4 seems to employ a binary threshold approach, categorizing games into ’high’ and ’low’ social dilemma buckets, rather than discerning the unique features of each game. Contrary to this, LLaMa-2 exhibits a more finely-grained understanding of the various game structures, even though it places greater emphasis on contextual factors compared to GPT-4. This suggests that LLaMa-2 is better equipped to navigate the subtleties of different strategic scenarios while also incorporating context into its decision-making, whereas GPT-4 adopts a more generalized, structure-centric strategy.
In addition to analyzing the decision-making patterns of these large language models, we examined anecdotal evidence to further decipher the mechanisms behind their distinct behaviors. GPT-3.5 appears to have a rudimentary understanding of strategic scenarios, frequently failing to identify best responses and committing a variety of basic mathematical errors. GPT-4, on the other hand, demonstrates a higher level of sophistication in its arguments. It often begins its reasoning by modeling the game structure and conditioning its responses based on anticipated actions of other players. However, GPT-4 also tends to mischaracterize game structures, often reducing them to variations of the Prisoner’s Dilemma, even when the structural nuances suggest otherwise. Interestingly, it adopts a different line of reasoning in games framed between friends, emphasizing the importance of longer-term relationships over immediate payoff maximization—despite explicit game descriptions to the contrary. LLaMa-2 approaches these strategic scenarios differently, initially abstracting the problem to a higher level using explicit game-theoretic language. It then layers contextual elements on top of this game-theoretic foundation, offering a well-rounded analysis that encompasses both game structure and situational factors.
Figure 1 shows the schematic workflow of this research and the process through which we generate our results. To each game we combine a context, a term we use to indicate the social environment in which the interaction described by the model takes place. We run 300 initializations per LLM for each of the 20 possible unique combinations of context and game, before aggregating the results in order to conduct our statistical analysis.
Figure 1: A schematic explanation of our data collecting process. A combination of a contextual prompt and a game prompt is fed into one of the LLM we examine in this paper, namely GPT-3.5, GPT-4, and LLaMa-2. Each combination creates a unique scenario, and for each scenario we collect 300initializations. Thedataforallscenariosplayedbyeachalgorithmisthenaggregatedandusedfor our statistical analysis, while the motivations provided are scrutinized in our Reasoning Exploration section.
We run our experiments using OpenAI’s gpt-3.5-turbo-16k and gpt-4 models, interfacing with them through Python’s openai package. For LLaMa-2, we utilize Northeastern University’s High Performance Cluster (HPC) as the model lacks a dedicated API or user interface. We access LLaMa-2 via the HuggingFace pipeline. To standardize our simulations, we restrict the response token count to 50 for the OpenAI models and 8 for LLaMa-2, setting the temperature parameter at 0.8. We opt for this temperature setting for several reasons: first, it mirrors the default settings in user-based applications like chatGPT, providing a realistic baseline; second, it allows for the exploration of multiple plausible actions in games with mixed Nash equilibria; and third, lower temperature settings risk obscuring the inherently probabilistic nature of these algorithms and may produce unengaging results. We note that high temperatures are commonly used in related working papers [25, 26].
Our experimental design includes two distinct prompts for each LLM. The initial prompt sets the context, outlining the environment and directing the algorithm to assume a specific role. Its aim is to create a realistic setting for the game to take place. The second prompt establishes the “rules,” or more accurately, the payoff structure of the game. While contextual prompts are disseminated via the system role, the payoff prompts are communicated through the user role. In both scenarios, we adhere to best practices such as advising the model to deliberate thoughtfully and utilizing longer prompts for clarity [25, 26]. The contextual prompts are crafted to be universally applicable to the range of games examined, sacrificing some degree of specificity for broader relevance. Detailed text for each prompt is available in Appendix A. Summarizing, we present the following scenarios:
• A summit between two heads of state from two different countries (“IR”), • A meeting between two CEOS from two different firms (“biz”),
• A conference between two industry leaders belonging to two different companies making a joint commitment on environmental regulations (“environment”),
• A talk between two employees who belong to the same team but are competing for a promotion (“team”),
• A chat between two friends trying to reach a compromise (“friendsharing”).
The games we use for our analysis are borrowed from the literature on social dilemmas in game theory. In particular, they all have the following form:
In this paper, we define “social dilemmas” any strategic interaction models that feature two types of actions: a socially optimal action that benefits both players if chosen mutually, and an individually optimal action that advantages one player at the expense of the other. We refer to the socially optimal action as “cooperation,” abbreviated as “C,” and the individually optimal action as “defection,” also abbreviated as “D.” For clarity, each pair of actions taken by players corresponds to a payoff vector, which we express in terms of utils or points, following standard game theory conventions. The first entry in the vector represents the row player’s payoff, while the second entry is reserved for the column player. In this framework, “R” signifies the reward for mutual cooperation, “T” represents temptation to defect when the other player cooperates, “S” indicates the sucker’s payoff for cooperating against a defector, and “P” stands for the punishment both players receive when both choose to defect, typically leading to a suboptimal outcome for both. Different relationships between these values give rise to different games:
• When T > R > P > S, the game is the Prisoner’s Dilemma;
• When T > R > S > P, the game is Snowdrift, also known as Chicken; • When R > T > P > S, the game is Stag Hunt;
• When R > T > S > P, the game is the Prisoner’s Delight, also known as Harmony.
This structure is in the spirit of  and , in which the same four game theoretic models are used to capture different types and degrees of social dilemma. We point out that Prisoner’s Delight is not exactly a dilemma, but rather an anti-dilemma, as choosing to cooperate is both socially and individually optimal. On the opposite end of the spectrum lies the Prisoner’s Dilemma, in which defecting is always optimal and thus leads to a situation in which both players are worse off, at least according to standard predictions in Game Theory.
Here we introduce a piece of important terminology: in the Prisoner’s Dilemma and in the Prisoner’s Delight, only one action is justifiable. This means that one action strictly dominates another, and therefore a rational player would only ever play the strictly dominant action. The Stag Hunt and Snowdrift lie somewhere in between, with both cooperation and defection being justifiable. More specifically, in the Stag Hunt, the Nash Equilibrium in pure actions is reached if both players coordinate on the same action (with the cooperative equilibrium being payoff dominant), whereas in Snowdrift said equilibrium is reached if both players coordinate on opposite actions. As neither action strictly dominates the other, a rational player is justified in playing either or both, and in fact for these games an equilibrium exists in mixed strategies as well.
For each game and for each context, we run 300 initializations and record the action taken by the LLM agent, and keep track of the rate of cooperation by the LLM agents for our follow up analysis. For each experiment, we keep the prompts constant across LLMs.
Figure 2 displays an overview of our results for all three LLMs. To better clarify the role of game structure vs. framing context, results are aggregated at different levels: we group the observations at the game level on the left at the context level on the right, and each row represents a different LLM. A few things appear immediately clear when visually inspecting the figure. First, GPT-3.5 tends not to cooperate regardless of game or context. Second, GPT-4’s choice of actions is almost perfectly
Figure 2: Summary of our findings, displayed using bar charts and outcomes grouped either by game or by context. On the y axis we display the average propensity to cooperate in a given game and under a given context, with standard error bars. Figures (a) and (b) refer to our experiments using GPT-3.5, and anticipate one of our key findings: context matters more than game in determining the choice of action for this algorithm. Figures (c) and (d) instead show how the opposite is true for GPT-4: almost all contexts are more or less playing the same strategy, that of cooperating in two of the four games and defecting in the remaining two. Finally, Figures (e) and (f) present our results for LLaMa-2, whose choice of action clearly depends both on context and on the structure of the game.
To further corroborate and substantiate our findings, we turn to dominance analysis using STAT. In practice, dominance analysis is used to study how the prediction error changes when a given independent variable is omitted from a statistical model. This procedure generates 2x − 1 nested models, with x being the number of regressors. The larger the increase on average over the nested models in error, the greater the importance of the predictor. . We run a logit regression for each LLM encoding each game and each context as a dummy variable, and then we use dominance analysis to identify which dummies have the largest impact on the dependant variable. The output is presented in Table 1. We notice that “friendsharing” consistently ranks in the top spots across all algorithms, and indeed by analyzing Figure 2 it appears immediately clear that this context is consistently associated with higher rates of cooperation regardless of game or LLM. For GPT-3.5, contexts represent the five most important variables, with games with a sole rationalizable action occupying positions 6 and 7. This suggests that GPT-3.5 might have a tendency to put weight on context first and on game structure last, with a slight bias for “simpler” games. For GPT-4, on the other hand, the ranking is almost perfectly inverted with games being the regressors with the highest dominance score. Prisoner’s Delight and Dilemma once again rank the highest among games for influence, while “friendsharing” is dethroned and relegated to the second position. The ranking for LLaMa-2 paints a more nuanced picture, with contexts and games alternating throughout the ranking, but with “friendsharing” still firmly establishing itself as the most influential variable.
While these rankings are in and of themselves informative, we are also interested in assessing whether contexts or games in aggregate are more important for a given LLM. We take the average of the importance score for each group (contexts and game) and plot that in Figure 3. By observing the graph, we can conclude that for GPT-3.5 context matters more on average, while the opposite is true for GPT-4. Moreover, LLaMa-2 is also more interested in games than in contexts, but not to the same extent as GPT-4. Having concluded this preliminary analysis, we take a closer look at how LLMs play different games across different contexts, and how their choice of action differs from game-theoretic equilibria. We point out that in the case of Stag Hunt and Snowdrift we use equilibria in mixed actions as our meter of comparison, but for both games playing any pure strategy could potentially constitute an equilibrium. Even so, we expect that a rational algorithm that randomizes between options would err towards the equilibrium mixture of these actions, and thus we include it as a general benchmark.
Figure 3: Average importance of context variables vs. game variable for each LLM. Results follow from the dominance analysis of table 1
Of the three LLMs we examine, GPT-3.5 is the least advanced and the most available to the general public, since the free version of chatGPT runs on 3.5. As seen in Figure 2, GPT-3.5 has a remarkable tendency to defect, even when doing so is not justifiable. Choosing to play an unjustifiable action is per se a symptom of non-strategic behavior, which coupled with a general aversion to cooperation might even indicate spiteful preferences. In game theory, players exhibit spiteful preferences when they gain utility from the losses incurred by their coplayer, or alternatively, when their utility gain is inversely proportional to the utility gain of their coplayers. This seems to be the case of the Prisoner’s Delight, in which for all contexts GPT-3.5 opts to defect significantly. Conversely, it is true that GPT-3.5 cooperates more than at equilibrium when playing the Prisoner’s Dilemma, and for some contexts its choices are strikingly prosocial when playing Snowdrift or Stag hunt. More to the point, it appears that the responses of GPT-3.5 depend on the context of the prompt. In a context in which the interaction is said to occur between a pair of friends, GPT-3.5 is more prone to cooperate than in scenarios in which competition is either overtly accounted for or implied. In order to gain a quantitative understanding of this variance in behavior, we conduct a difference in proportions Z-test between different contexts, including the game-theoretic equilibrium as a baseline. This is because GPT-3.5 is a probabilistic model, and thus its actions are a consequence of a sampling from a distribution. As such, we are interested in measuring how this distribution differs from equilibrium and from other samplings occurring under different contexts. The result of our analysis is displayed in Figure 4. We compare the proportion of initializations in which GPT-3.5 has chosen to defect in a given context against the same quantity either in another context or at equilibrium, and assess whether the difference is statistically significant from zero. It bears pointing out that differences from equilibrium are not the sole argument against the rationality or sophistication of GPT-3.5. In fact, the difference in strategies among different contexts when playing the same game is already an indicator that the LLM is susceptible to framing effects. Indeed, we observe that “friendsharing” and “IR” consistently choose more cooperation than other contexts, although not always at a statistically significant level. The opposite is true for “biz” and “environment,” with “team” falling somewhere in the middle but closer to this latter group. Notably, all contexts play Snowdrift and Stag Hunt at levels close or equal to equilibrium, with small but statistically significant differences. Here and elsewhere in the paper we observe that Stag Hunt induces more cooperation than Snowdrift,adiscomfortingfactinthelightofSnowdrift’soriginsasamodelfornuclearbrinkmanship.
Compared to its predecessor, GPT-4 performs a lot better in terms of both strategic behavior and cooperation. For instance, when playing Prisoner’s Delight under any context, the LLM will always choose to cooperate, which is the sole justifiable action. Nevertheless, context dependence is still very strong under “friendsharing” and the algorithm will always choose to cooperate regardless of the game. As for the other contexts, in broad strokes, they could be characterized as following two regimes: a cooperative one when playing Stag Hunt and Prisoner’s Delight, and a more hostile one when playing Snowdrift and the Prisoner’s Dilemma. This grouping indicates that, just like for GPT-3.5, GPT-4 behaves with more hostility when playing Snowdrift compared to when playing Stag Hunt, suggesting that the value of R holds substantial sway to the algorithm when an explicit maximization task is assigned to it. Looking at Figure 5, we observe that individual contexts do in fact play each game differently (with the exception of Prisoner’s Delight, which induces full cooperation). Of particular relevance is the fact that games with a sole justifiable action (namely Prisoner’s Dilemma and Prisoner’s Delight) are played very similarly between different contexts, with “friendsharing” and “environment” behaving significantly more cooperatively than the other context when playing Prisoner’s Dilemma. Snowdrift very closely mimics the results from the Prisoner’s Dilemma, albeit with significantly more variance in results. This pattern plays out identically when looking at the two remaining games, Stag Hunt and Prisoner’s Delight. The former is more varied in results and displays more propensity to defect, yet it closely tracks the results of Prisoner’s Delight. Looking at the results for all four games side-by-side, a more general pattern emerges of GPT-4 becoming more cooperative across all context as the value of R and of S increases. In other words, as cooperation becomes more rewarding, GPT-4 adjusts its preferences towards defecting less, as would be expected of a rational player.
As for LLaMa-2, it presents a very unique and interesting set of results. A brief glance at Figure 12 shows that, while “friendsharing” still induces the most cooperation, it is now joined by “environment” as the second most cooperative context. The other three contexts operate somewhat similarly and tend to be more prone to defection. Just like for GPT-4, games follow two regimes:
Figure 4: Difference-in-Proportion testing using Z-score for each game across contexts when using GPT-3.5. A negative number (in orange) represents a lower propensity to defect vs. a different context, and vice-versa for a positive number (in dark blue). One asterisk (*) corresponds to 5% significance in a two-tailed Z-score test, two asterisks (**) represent 1% significance, and three asterisks (***) 0.1% significance. Results are inverted and symmetric across the main diagonal, and thus entry (i,j) contains the inverse of entry (j,i)
Prisoner’s Dilemma and Snowdrift induce higher defection, whereas Stag Hunt and Prisoner’s Delight induce more cooperation. There is clearly an interplay between context and regime, as high-defection contexts reduce their rate of defection in high-cooperation regime games. Beyond the similarities with GPT-4, LLaMa-2 displays less defection in Snowdrift and less cooperation in Stag Hunt, which could potentially indicate that LLaMa-2 is more capable of strategic behavior. Indeed, playing a mix of the two strategies (even when that mix does not coincide with equilibrium) may mean that the algorithm recognizes the two strategies as justifiable and accordingly opts to play both. On the other hand, LLaMa-2 defects more often when playing Prisoner’s Delight and cooperates more often when playing Prisoner’s Dilemma, which instead points to the fact that this LLM might not fully grasp what makes an action justifiable. Prima facie, these results thus appear to lie somewhere in between GPT-3.5 and GPT-4.
Figure 5: Difference-in-Proportion testing using Z-score for each game across contexts using GPT-4. The methods employed are the same as those described in Figure 4
Results from Figure 6 show that while we have grouped contexts to be either more or less cooperative, they do, in fact, differ from each other within this broad-stroke generalization. For instance, “biz” defects more often than “IR” and “team” and this propensity is statistically significant when playing Snowdrift, Stag Hunt and Prisoner’s Delight. Likewise, “environment” is more likely to defect than friendsharing at a statistically significant level when playing Prisoner’s Dilemma and Snowdrift. Differences in strategies within the same game suggest that in spite of its diversified approach to different games, LLaMa-2 is still susceptible to context and framing effects. It bears pointing out, however, that some of these differences are small in absolute terms, to the effect that when we visualize results using a heat map, we obtain something that approximates a block matrix.
Having assessed how different LLMs play the same game under different contexts, we are now interested in running the opposite analysis instead, namely verifying how each context provided to an
Figure 6: Difference-in-Proportion testing using Z-score for each game across contexts using LLaMa-2. The methods employed are the same as those described in Figure 4
LLM influences its choice of strategy across different games. In the case of perfectly rational agents, we would expect them to play all four games differently regardless of context. Thus, just like in Figures 4 – 6, we conduct a battery of difference-in-proportions Z-test, this time across games and for each prompt.
Our results concerning GPT-3.5 (reported in Figure 7) were surprising but not entirely unexpected: for most scenarios, the game setting does not matter and only the prompt dictates a difference in strategies. This is most evident under the Team Talk prompt, which shows that no matter the game the difference in propensity to defect is not statistically different from 0. Under the “biz” prompt, GPT-3.5 defects less at a statistically significant level only when playing Prisoner’s Delight. In “friendsharing”, we observe a statistically significant decrease in the level of defections only in the Prisoner’s Delight and only with respect to Snowdrift and the Prisoner’s Dilemma. What’s more, these differences are at the knife edge of statistical significance. In the Environmental Negotiations scenario, the algorithm adopts two distinct regimes: a friendly one when playing Stag Hunt and Prisoner’s Delight, and a hostile one otherwise. Notice that these two regimes are not otherwise distinguishable from a statistical standpoint. The “IR” setting mimics this pattern, although at an overall lower level of significance. Overall, these observations help us better understand our results from Figure ??, in that they show just how little the structure of the game matters to GPT-3.5 when compared to context.
Figure 7: Difference-in-Proportions Z-score testing for each context across games using GPT-3.5. We use the same methods as in Figure 4, and the same classification for levels of statistical significance, but we do not compare the results to any equilibrium strategy. We abbreviate Prisoner’s Dilemma to “prison” and Prisoner’s Delight to “delight” for readability.
Figure 8 encloses our results for GPT-4. Immediately, we notice the persistence of a certain pattern. More specifically, across all contexts, there is a box-shaped pattern that consistently appears: Prisoner’s Dilemma and Snowdrift are very similar to one another, and very different from Prisoner’s Delight and Stag hunt (and vice-versa). Differences within the pairs exist for some contexts: “biz” and “IR” cooperate more when playing Prisoner’s Delight than when playing Stag Hunt, and “environment” cooperates more when playing Snowdrift than when playing the Prisoner’s Dilemma. These differences within pairs are more pronounced in “biz” and “environment” in a mirrored fashion: for games in which both cooperation and defection are justifiable, the former has a slight bias for defection, while the latter has a small bias for cooperation. The box-shaped pattern can be even observed (although weakly and without statistical significance) even when looking at the across-games comparison for “friendsharing”, and it is fully encapsulated in the results from Team Talk. Just like for GPT-3.5, through this analysis we gain a better appreciation for how much the game matters above and beyond context for GPT-4. Even so, a box-shaped pattern points at the fact that the algorithm might not be fully capable of telling games apart beyond a certain threshold, therefore exhibiting improved but still imperfect levels of rationality.
Figure 8: Difference-in-Proportions Z-score testing for each context across games when using GPT-4, using the same methods as in Figure 7.
On the contrary, When examining the results from Figure 9, we observe an heretofore unseen pattern in differences across games for each context. Earlier, we remarked that the results from LLaMa-2 appear to be in between GPT-3.5 and GPT-4. Our analysis in this section instead shows that they are quite unlike either. For instance, GPT-4 plays something closer to pure strategies in all games, whereas GPT-3.5 and LLaMa-2 both play mixed strategies when both actions are justifiable. However, unlike GPT-3.5, LLaMa-2 properly recognizes different game structures and adapts its strategy accordingly. In particular, “biz”, “team” and “IR” follow a different strategy for each game, behaving most cooperatively when playing Prisoner’s Delight and least cooperatively when playing the Prisoner’s Dilemma, with the other games occupying intermediate positions. This observation is in line with what could already be gauged from observing Figure 2, and shows that for most contexts, LLaMa-2 acts very strategically. More specifically, LLaMa-2 appears to be able to recognize the differences in the payoff structures and alter its choice of actions accordingly, although not necessarily always playing the equilibrium. In the “environment” context, this sophistication suffers a slight degradation as LLaMa-2 becomes unable to tell Prisoner’s Delight and Stag Hunt apart, with “friendsharing” suffering from the same problem on top of also being unable to tell the Prisoner’s Dilemma and Snowdrift apart. Summing up, while the results for the dominance analysis clearly indicate that LLaMa-2 is more context-driven than GPT-4, it seems that unlike the latter, the former is more capable of telling different game structures apart and adapting it strategy accordingly.
Figure 9: Difference-in-Proportions Z-score testing for each context across games when using LLaMa-2, using the same methods as in Figure 7.
Making a final assessment on the rationality of these algorithms from a game-theoretic perspective is no easy task. For GPT-3.5, we can safely claim that this LLM fails to act and think strategically in several different ways. Moreover, as already remarked, GPT-3.5 plays the same game differently when given a different contextual prompt, but does not play different games differently when given the same contextual prompt. This shows that the framing effect from the context is a more important factor for the algorithm’s final decision compared compared to the extant structure of incentives, unlike what happens for its successor GPT-4. Indeed, for this large language model, the game itself plays a larger role in guiding the behavior of GPT-4. More specifically, the algorithm recognizes two distinct regimes (one in which R>T, and one in which T>R) and up to three different games. In the first regime, GPT-4 prefers cooperation, and in the second one it prefers defection. These overall preferences are mediated by the context supplied, but they are never fully erased or supplanted, not even under “friendsharing”, the strongest context in terms of shaping the behavior of the algorithm. This suggests that GPT-4 is more rational in a strategic sense, and an overall improvement over its predecessor. Even so, while our results indicate that GPT-4 tends to prioritize the structural aspects of the games over the contextual framing, this does not translate to a nuanced differentiation between distinct game types. In fact, GPT-4 seems to employ a binary threshold approach, categorizing games into ’high’ and ’low’ social dilemma buckets, rather than discerning the unique features of each game. Contrary to this, LLaMa-2 exhibits a more finely-grained understanding of the various game structures, even though it places greater emphasis on contextual factors compared to GPT-4. This suggests that LLaMa-2 is better equipped to navigate the subtleties of different strategic scenarios while also incorporating context into its decision-making, whereas GPT-4 adopts a more generalized, structure-centric strategy. The intricacies and idiosyncrasies of these algorithms make it difficult to give a final verdict on whether GPT-4 or LLaMa-2 is superior in terms of strategic thinking, and therefore we rather point out that both are flawed in different ways.
Over the course of this paper, we have investigated the capability of Large Language Models to act strategically using classic examples of social dilemmas from Game Theory. In particular, we have assessed how the context provided when presenting a model of interaction shapes and guides decision. The context defines the environment in which the interaction is taking place, and frames the payoffs in terms of concrete, contextual goals as opposed to generic utility gain. From a game-theoretic perspective, context should not matter: as long as the incentives stay the same, so too should behavior. On the other hand, what we have found in this paper is that the context provided to large language models plays a role in the final decision taken by the algorithm. More in particular, GPT-3.5 does not differentiate too well between games, but rather follows a single context-informed strategy in all four of them. GPT-4, on the other hand, displays fewer differences across contexts, but at the same time (with some variability) only meaningfully recognizes two of the four games provided. LLaMa-2 exhibits yet another mode of behavior, which is more capable of telling different games apart than GPT-4 but is at the same time more susceptible and affected by context. In our querying of different LLMs, we always instruct each algorithm not to answer us with an explanation of their reasoning but rather just their choice of action. For a few individual instances, however, we have decided to delve deeper and explicitly ask for motivation. We do so in order to catch a glimpse of what the processes underlying each decision are, and while we cannot offer a comprehensive review of each one of them, we have nevertheless obtained some informative anecdotes from our experiments. First, when asking GPT-3.5 to explicitly motivate its choices, we observe that its reasoning is faulty and flawed in that it fails to carry out simple mathematical comparisons and to account for coplayer action. In the following example, we present evidence of GPT-3.5’s difficulties in assessing which of two numbers is larger when playing the Prisoner’s Delight under the “biz” context:
Next, we provide GPT-3.5 the “biz” context and the Snowdrift game, and ask to motivate its choice of strategy. We observe that on top of the mathematical mistakes it made before, it now seems unable to take into account coplayer’s reasoning and actions:
We run the same informal check by looking at the motivations that GPT-4 gives for its actions. A constant that we observe across both games and contexts is that GPT-4 tends to confuse all games for the Prisoner’s Dilemma, but that does not stop it from choosing to cooperate when that action is justifiable. For example, this is how it motivates its choice to cooperate when playing Stag Hunt under the “biz” context:
Notably, action C is not merely chosen because it is justifiable, but also because GPT-4 envisions that an equally clever opponent would realize the implicit incentives that exist to coordinate on the most rewarding action. Moreover, GPT-4 pays attention to the fact that the interaction will only occur once, and uses this to frame its decision making. The following is an example when the algorithm plays the Prisoner’s Dilemma under the “friendsharing” context:
In other words, GPT-4 recognizes that not only it cannot build reputation, but also that it cannot gain it back. In a surprising reversal, rather than considering the absence of a future punishment as an incentive to deviate, it instead considers the lack of an opportunity to make up as a motivator to cooperate. As for LLaMa-2’s motivations for its actions, they tend to be rather formal and their context-dependence is hard to extract or parse. For instance, when asked to explain its thought process behind its choice of action when the game is the Prisoner’s Dilemma and the context is “friendsharing”, its response is:
Even though this is just an individual example, most of LLaMa-2’s replies tend to follow this pattern and emphasize the search for a best response rather than openly citing the circumstances surrounding the interaction as a motivator. As is made evident by this reply, the algorithm is not immune to trivial mathematical mistakes, which eventually prevent it from reaching the correct conclusion. This is also the case when playing Prisoner’s Delight under the “biz” contextual framing:
While LLaMa-2 prefers to pick C when playing Prisoner’s Delight (irrespective of context), when it does pick D it will still try to reason as if looking for an unconditional best response.
Overall, this informal inquiry into the motivations given by large language models for their choices of action substantially affirms the result of our earlier quantitative analysis. GPT-3.5 confirms itself as incapable of strategic behavior, sometimes to the effect that its preferences become spiteful. Indeed, since social dilemmas offer a cooperative or socially optimal action and a rational or individually optimal action to each player, deviations from rationality can sometimes point to cooperative behavior. In our study of Prisoner’s Delight, however, we have seen GPT-3.5 frequently fail to choose the “double optimum” (i.e. the action that is both socially and individually optimal), pointing to the fact that the algorithm is unsophisticated at best and spiteful at worst.
GPT-4, on the other hand, is more strategic in the choices it makes and responds more strongly to incentives: it will pick the individually optimal action when it stands to gain more from it, and it will pick the socially optimal actions when it would be more rewarding to do so. Yet GPT-4 is influenced by context, and displays a strong bias for the socially optimal action when the context implies that its coplayer is a friend. Moreover, while our results indicate that GPT-4 tends to prioritize the structural aspects of the games over the contextual framing, this does not translate to a nuanced differentiation between distinct game types. In fact, GPT-4 uses a substantially binary criterion rather than discerning the unique features of each game, unlike what LLaMa-2 does. Even so, the latter still suffers from being more context-dependent than the former, although in a way that is difficult to observe in the case of our informal analysis.
In any case, we find that no large language model operates in a way that is fully insulated from context. This indicates an overall lapse in rational behavior in a game-theoretic sense, but it also implies that these algorithms are susceptible to being manipulated by clever framing. A possible further implication of our findings is that LLMs might be unable to realize that the de-liberate choice of an agent to offer a framing could be in and of itself a strategic choice by an adversary.
While our results suggest that Large Language models are unfit for strategic interaction, they represent just some preliminary findings in a field of study we anticipate will be rich and large. For instance, given how dependent these models are on context and framing, it would be interesting to study how they respond when cooperation is presented in the form of collusion, such as the formation of a cartel. Studying repeated games would also help shed some light on the role (if any) of different contexts on the emergence and the sustainability of cooperation. Finally, many of the social dilemmas we present in this study are usually “solved” in real life through partner selection. Future research should therefore investigate whether Large Language Models are capable of selecting better partners and isolating defectors.
 Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
 Daniel Martin Katz, Michael James Bommarito, Shang Gao, and Pablo Arredondo. Gpt-4 passes the bar exam. Available at SSRN 4389233, 2023.
 Mingyu Zong and Bhaskar Krishnamachari. Solving math word problems concerning systems of equations with gpt-3. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 15972–15979, 2023.
 Jiayan Guo, Lun Du, and Hengyu Liu. Gpt4graph: Can large language models understand graph structured data? an empirical evaluation and benchmarking. arXiv preprint arXiv:2305.15066, 2023.
 Konstantine Arkoudas. Gpt-4 can’t reason. arXiv preprint arXiv:2308.03762, 2023.  Chris Frith and Uta Frith. Theory of mind. Current biology, 15(17):R644–R645, 2005.
 Manmeet Singh, Vaisakh SB, Neetiraj Malviya, et al. Mind meets machine: Unravelling gpt-4’s cognitive psychology. arXiv preprint arXiv:2303.11436, 2023.
 Thilo Hagendorff and Sarah Fabi. Human-like intuitive behavior and reasoning biases emerged in language models–and disappeared in gpt-4. arXiv preprint arXiv:2306.07622, 2023.
 Hanmeng Liu, Ruoxi Ning, Zhiyang Teng, Jian Liu, Qiji Zhou, and Yue Zhang. Evaluating the logical reasoning ability of chatgpt and gpt-4. arXiv preprint arXiv:2304.03439, 2023.
 Rohaid Ali, Oliver Young Tang, Ian David Connolly, Patricia L Zadnik Sullivan, John H Shin, Jared S Fridley, Wael F Asaad, Deus Cielo, Adetokunbo A Oyelese, Curtis E Doberstein, et al. Performance of chatgpt and gpt-4 on neurosurgery written board examinations. medRxiv, pages 2023–03, 2023.
 John C Lin, David N Younessi, Sai S Kurapati, Oliver Y Tang, and Ingrid U Scott. Comparison ofgpt-3.5,gpt-4, andhumanuserperformanceonapracticeophthalmologywrittenexamination. Eye, pages 1–2, 2023.
 Joost CF de Winter. Can chatgpt pass high school exams on english language comprehension. Researchgate. Preprint, 2023.
 James Brand, Ayelet Israeli, and Donald Ngwe. Using gpt for market research. Available at SSRN 4395751, 2023.
 Aref Mahdavi Ardekani, Julie Bertz, Michael MDowling, and Suwan Cheng Long. Econsent gpt: A universal economic sentiment engine? Available at SSRN, 2023.
 Yiting Chen, Tracy Xiao Liu, You Shan, and Songfa Zhong. The emergence of economic rationality of gpt. arXiv preprint arXiv:2305.12763, 2023.
 Gati Aher, Rosa I Arriaga, and Adam Tauman Kalai. Using large language models to simulate multiple humans. arXiv preprint arXiv:2208.10264, 2022.
 John J Horton. Large language models as simulated economic agents: What can we learn from homo silicus? Technical report, National Bureau of Economic Research, 2023.
 Thilo Hagendorff. Machine psychology: Investigating emergent capabilities and behavior in large language models using psychological methods. arXiv preprint arXiv:2303.13988, 2023.
 Lisa P Argyle, Ethan C Busby, Nancy Fulda, Joshua R Gubler, Christopher Rytting, and David Wingate. Out of one, many: Using language models to simulate human samples. Political Analysis, 31(3):337–351, 2023.
 Steve Phelps and Yvan I Russell. Investigating emergent goal-like behaviour in large language models using experimental economics. arXiv preprint arXiv:2305.07970, 2023. Fulin Guo. Gpt agents in game theory experiments. arXiv preprint arXiv:2305.05516, 2023.
 Philip Brookins and Jason Matthew DeBacker. Playing games with gpt: What can we learn about a large language model from canonical strategic games? Available at SSRN 4493398, 2023.
 Elif Akata, Lion Schulz, Julian Coda-Forno, Seong Joon Oh, Matthias Bethge, and Eric Schulz. Playing repeated games with large language models. arXiv preprint arXiv:2305.16867, 2023.
 Varda Liberman, Steven M Samuels, and Lee Ross. The name of the game: Predictive power of reputationsversussituationallabelsindeterminingprisoner’sdilemmagamemoves. Personality and social psychology bulletin, 30(9):1175–1185, 2004.
 Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
 Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
 David A Gianetto and Babak Heydari. Catalysts of cooperation in system of systems: The role of diversity and network structure. IEEE Systems Journal, 9(1):303–311, 2013.
 David A Gianetto and Babak Heydari. Network modularity is essential for evolution of cooperation under uncertainty. Scientific reports, 5(1):9340, 2015.
 Joseph N. Luchman. Determining relative importance in stata using dominance analysis: domin and domme. The Stata Journal, 21(2):510–538, 2021.
Appendix A: Prompts
Meeting between CEOS, or “biz”:
Negotiations over Environmental Regulation, or “environment”:
Chat between friends, or “friendsharing”:
Talk between teammates, or “team”:
Summit between international leaders, or “IR”:
Appendix B: Additional Figures
Figure 10: Bar chart visualization of the propensity to defect or cooperate for each context and for each game using GPT-3.5. In red, the percentage of times the algorithm chose to defect. The dark red striped bar indicates equilibrium values. in the Prisoner’s Delight, a rational player would never defect, and thus no bar is displayed. For Stag Hunt and Snowdrift, we indicate as “equilibrium” the probabilities an equilibrium mixed strategy would assign to either action, but both games possess multiple equilibria in pure strategies.
Figure 11: Stacked bar chart visualization of the propensity to defect for each context and for each game using GPT-4. The methods employed are the same as those described in Figure 10
Figure 12: Bar chart visualization of the propensity to defect for each context and for each game using LLaMa-2. The methods employed are the same as those described in Figure 10