Instigating Cooperation among LLM Agents Using Adaptive Information Modulation

Qiliang Chen
Northeastern University
Boston, MA 02115, USA
[email protected]
&Sepehr Ilami
Northeastern University
Boston, MA 02115, USA
[email protected]
&Nunzio Lore
Northeastern University
Boston, MA 02115, USA
[email protected]
&Babak Heydari*
Northeastern University
Boston, MA 02115, USA
[email protected]

Abstract

This paper introduces a novel framework combining LLM agents as proxies for human strategic behavior with reinforcement learning (RL) to engage these agents in evolving strategic interactions within team environments. Our approach extends traditional agent-based simulations by using strategic LLM agents (SLA) and introducing dynamic and adaptive governance through a pro-social promoting RL agent (PPA) that modulates information access across agents in a network, optimizing social welfare and promoting pro-social behavior. Through validation in iterative games, including the prisoner’s dilemma, we demonstrate that SLA agents exhibit nuanced strategic adaptations. The PPA agent effectively learns to adjust information transparency, resulting in enhanced cooperation rates. This framework offers significant insights into AI-mediated social dynamics, contributing to the deployment of AI in real-world team settings.

1 Introduction

Interactions within social and sociotechnical systems are frequently characterized by a delicate balance of cooperation and competition, often leading to complex social dilemmas [1, 2, 3, 4, 5]. The challenge of governing these systems effectively hinges on the ability to foster and sustain prosocial behavior, thereby enhancing both system-level efficiency and fairness. The importance of such governance frameworks is underscored by their potential to increase overall social welfare, making the study of prosocial behavior crucial in a variety of contexts [6, 7, 8, 9].

Historically, numerous disciplines have explored mechanisms for promoting prosocial behavior. However, these efforts have largely resulted in static, deterministic recommendations, which fail to account for the dynamic nature of agent interactions and learning processes [10, 11, 12, 13]. Consequently, traditional governance methods have often been limited to highly stylized or context-specific heuristics that do not easily generalize across different settings.

Reinforcement learning (RL) has emerged as a promising tool for developing dynamic governance frameworks aimed at encouraging prosocial behavior in the face of social dilemmas [14, 15, 16]. Despite its theoretical potential, the practical application of RL in this domain has been constrained by the significant costs and time associated with training RL agents using extensive human behavior data. While agent-based simulations have provided a stylized environment for modeling the evolution of prosocial behavior [17, 18, 19], the reliance on assumptions about agent behavior in strategic scenarios presents significant limitations. These behaviors are influenced by a complex interplay of game-theoretic structures, contextual factors, and the bounded rationality of agents, leading to outcomes that often deviate from theoretical predictions.

In this paper, we propose that the advent of advanced large language models (LLMs) offers a transformative opportunity to develop more robust governance mechanisms for sociotechnical systems. Recent studies have demonstrated that LLMs are capable of capturing nuanced strategic decision-making behaviors in classic games such as the Prisoner’s Dilemma, Snowdrift, and Stag Hunt. Importantly, these behaviors are influenced not only by the payoff structure of the game but also by the contextual framing, suggesting that LLM agents could serve as more accurate proxies for human behavior [20]. By leveraging the computational power of traditional agent-based models (ABMs) alongside LLMs, we can facilitate RL training without the need for overly simplistic assumptions, thus resulting in more efficient dynamic governance schemes. Our proposed framework is designed to address two key objectives: creating dynamic governance mechanisms for human interactions using LLM agents as proxies and enhancing prosocial behavior and alignment in hybrid environments with both LLM and human agents.

Refer to caption — Figure 1: Overview of the general framework. The framework includes two main entities: the Strategic LLM Agent (SLA) and the Pro-social Promoting Agent (PPA). 1) SLAs receive prompts that describe pairwise strategic games (payoff matrix, objectives, and additional information from the PPA) and then make strategic decisions like cooperation or defection. Multiple SLAs are placed in a random network, with connections initialized each round. SLAs may make different decisions in different interactions based on varying information received. Prompts are refined through micro-level validation for consistent behavior. 2) The PPA acts as a system manager, observing SLAs and dynamically determining their information levels, trained via reinforcement learning to maximize social welfare. Results are presented to evaluate the framework.

Our approach encompasses both modeling and governance dimensions. On the modeling front, we extend the current literature on LLMs by exploring their behavior in repeated games, allowing for the evolution of agent strategies based on detailed historical interactions within the system. On the governance front, we develop an RL-based governing agent tasked with dynamically intervening in the system to promote prosocial behavior. Specifically, our method involves dynamically adjusting the level of information access available to each LLM agent regarding the strategic behaviors of others. This approach to governance—modulating information access—offers two significant advantages: it preserves the decentralized nature of the system by maintaining agent autonomy, and avoids the need for costly and often impractical changes to game payoffs.

Our findings provide insights on both modeling and governance levels: For modeling, using what we refer to as micro-level validation, we show that LLM agents are capable of capturing nuanced strategic behavior, demonstrating significant and reasonable behavioral adaptations in response to changes in information access. Furthermore, we show that the reinforcement learning agent effectively learns to dynamically adjust information access, resulting in an increase in cooperation rates compared to different static baseline interventions without RL. Overall, although a simple and small implementation of the proposed framework, this work contributes to the expanding field of AI-mediated social dynamics, offering valuable insights into the deployment of AI in complex, real-world team settings.

Despite being an early-stage implementation with limited scale and complexity, this work already demonstrates how integrating interactive LLM agents with reinforcement learning can pave the way for new forms of dynamic governance, even when applied to the extreme social dilemma of the Prisoner’s Dilemma. Our framework offers a promising foundation for AI-mediated social dynamics, showcasing the potential for leveraging LLMs in complex, real-world settings to manage and influence prosocial behavior by using AI to dynamically modulate the visibility and recall of human and LLM agents.

2 Related Work

Mechanisms to Promote Cooperation under Social Dilemma

A significant body of literature has examined mechanisms that influence cooperation in strategic interactions involving social dilemmas. Reputation and reciprocity are well-established factors, with interventions that establish and reinforce them consistently promoting cooperative behavior [21, 22]. Punishment and reward systems have also proven effective, deterring defection while incentivizing prosocial actions [23, 24]. Additionally, the structure of interaction networks plays a crucial role, either by identifying network characteristics that support cooperation [25, 26] or by examining how changes in the network structure over time can influence cooperation [27, 28].

While these factors are effective in fostering cooperation, many require extended periods of interaction to yield results (e.g., promoting cooperation through the emergence of new norms or conventions) or rely on interventions that contradict agents’ autonomy (e.g., top-down approaches that alter network structure [14, 29]). Another influential factor is the extent to which agents can observe others’ behaviors, and recall historical information [30, 31]. In this work, we focus on this factor as our intervention mechanism, where the RL agent dynamically modulates the level of information available to LLM agents—specifically through adjustments in observation and recall, in order to increase the overall cooperation rate.

LLMs and Strategic Decision Making

Recent studies reveal that large language models (LLMs) are capable of handling basic economic and game-theoretic scenarios [32, 33, 34, 35], yet their decision-making processes are often unclear and they seem to struggle with belief refinement [36]. When these models are evaluated as substitutes for human agents, the results frequently diverge from the predictions of both rational choice theory and behavioral economics, raising questions about their cognitive fidelity [37, 38, 39, 40]. This has sparked a debate on the most appropriate methods for evaluating these models [41, 42], particularly in terms of their alignment with human-like reasoning. Nonetheless, there is growing optimism about their potential to simulate human thought and behavior, as ongoing advancements continue to enhance their ability to replicate complex cognitive processes. [43, 44, 45, 40, 46].

LLMs and Multi-Agent Systems

Research on LLM-empowered multi-agent systems can be broadly categorized into two primary domains: simulations and implementations. Implementations refer to algorithms or platforms that leverage the interactive capabilities of LLMs to generate concrete, end-to-end solutions or finished products [47, 48, 49]. These systems focus on harnessing the synergistic interactions among multiple LLMs to achieve specific, often practical, outcomes. In contrast, simulations aim to explore and demonstrate the potential of LLM-powered agents to emulate human-like behavior, interactions, and social dynamics across a variety of controlled environments [50, 51, 52, 53, 36]. These simulations provide a sandbox for investigating the emergent properties of LLMs in multi-agent contexts, shedding light on their ability to replicate complex human dynamics. Our research aligns more closely with the simulation paradigm, offering a detailed examination of how LLMs can capture human-like interactions. However, the insights gained from this work also carry significant implications for the development of implementation frameworks, suggesting pathways for translating simulated behaviors into practical applications.

3 Methods

3.1 Overview of the general framework

Figure 1 illustrates the overall framework used in this paper. We begin by developing interactive strategic LLM agents (SLAs) and a dynamic prompt structure, which is then employed by a reinforcement learning (RL) agent to encourage prosocial behavior in social dilemma games, such as the prisoner’s dilemma (See the SI document, section A, for more details). A key step in this process is what we call micro-level validation, where we design the prompting template to ensure that agents’ cooperative behavior shifts appropriately in response to governance signals intended for use by the RL agent. The SLAs are then placed as nodes of a network, where pairs of neighbors engage in a multi-period strategic game (prisoner’s dilemma (PD)). In each period, they can choose to cooperate or defect in interactions with their network neighbors, with each SLA capable of selecting different actions for different neighbors at any given time.

In each period, the RL agent sends a vector of signals to the network (one per SLA), aiming to maximize the discounted sum of scores for all agents, which, in the case of the PD game, requires a high rate of cooperation across the network. To preserve the autonomy of the SLAs, the steering signals only modify the level of information each SLA has about the past cooperative behavior of other agents, with variations in aggregation levels and conditions (e.g., average of past mutual history). To assess whether RL intervention has increased the rate of cooperation and overall average pay-off, we compare these trends against a set of benchmarks. We now go over the key parts of the framework.

3.2 Strategic LLM Agents (SLA) and Micro-level Validation

The Strategic LLM Agent (SLA) is central to our framework, designed either as a digital proxy for human agents in complex multi-agent scenarios or for task-specific multi-agent LLMs. We restrict SLA interactions to pairwise social dilemma games, such as prisoners’ dilemma. In every period, SLAs receive messages describing the nature of their pairwise strategic games, conveyed solely through the payoff matrix and objectives, omitting explicit references to game names (e.g., Prisoner’s Dilemma). Additionally, SLAs can access various information provided by the Pro-social Promoting Agent (PPA), such as the cooperation rates of their co-players, their neighborhood or the entire network. This information, combined with the structure of the strategic game, generates prompts that guide SLAs in making strategic decisions—whether to cooperate or defect. The prompts and information sets are refined through micro-level validation to ensure consistent and reasonable SLA behavior.

The SLAs are positioned in a randomly structured network, which is initialized at the start of each round and remains fixed throughout that round. Initially, SLAs are homogeneous in type but heterogeneous in their network positions. Each SLA engages in pairwise strategic games with all directly connected co-players at each time step. An SLA may participate in multiple interactions simultaneously, potentially making different decisions with different co-players. As interactions progress, SLAs can evolve along different trajectories, leading to heterogeneity in both agent types and network positions.

Micro-level Validation: While LLMs have shown impressive performance across a wide range of interactive decision-making tasks, research has raised concerns that LLMs may not fully grasp the tasks they perform, sometimes resulting in erratic behavior [40, 54]. Given these concerns, we need to first validate their strategic behavior at a micro level—through individual and pairwise interactions. Our micro-level validation has three objectives: first, to assess whether LLMs can comprehend strategic setups through utility matrices alone without explicit reference to the game’s name; second, to evaluate how different types of information provided to LLMs influence their behavior, thereby testing the feasibility of governing the system by manipulating information through the PPA; and finally, to determine whether altering the information provided results in reasonable qualitative changes in LLM decision-making behavior. Our approach aligns more with internal validation [55], focusing on consistent and predictable behavioral adaptation within the LLM agents themselves, rather than external validation, which compares SLA behavior to that of human agents—although the distinction between the two has become increasingly blurred for LLM-based models.

To achieve these objectives, we systematically adjusted prompts with different levels of information access to evaluate LLM responses, ensuring that SLA behavior shifts appropriately. Detailed results of these experiments are provided in subsequent sections (see the results section, also SI document, Section D, for more information). These findings informed the refinement of the prompts used to describe the prisoner’s dilemma and the definition of LLM decision-making objectives.

3.3 Pro-social Promoting Agent (PPA)

The pro-social promoting agent (PPA) serves as the governing entity within our framework and is trained using reinforcement learning. Its role is to enhance network-level social welfare (the sum of all SLA scores) by dynamically adjusting the information levels provided to each agent.

The PPA is a standard Actor-Critic RL agent, whose environment is modeled as a Partially Observable Markov Decision Process (POMDP) (See the SI document, section C, for more details). The PPA strategically tailors the information disclosed to each SLA, providing different signals to different SLAs during each interaction period. The action set of the RL includes the four tiers of information access: (1) no information about past interactions; (2) the last action pair between the SLA and its co-players; (3) the last action pair combined with the long-term cooperation ratios of both the SLA and its co-players; and (4) the last action pair along with the long-term cooperation ratios of the SLA and all neighboring agents. The scenario of having no information is often unrealistic, as agents are generally expected to recall the history of their last interaction with other players. Therefore, in our experiments, we limit the use of this action. The RL agent is rewarded based on the discounted sum of social welfare of the network (the sum of all SLA pay-offs), which for the case of the Prisoners’ Dilemma game, is expected to be highly correlated with the overall rate of cooperation in the network.

4 Experiment results and analysis

In this section, we begin by describing the settings of our experimental environment. We will then discuss the outcomes of micro-level validation for LLM agents, including how we crafted the prompts for subsequent experiments. Lastly, we will present the results of system governance conducted by the RL manager across multiple LLM agents and analyze the evolution of the environment throughout the experiments.

Table 1: Micro-level validation of Last Action. This table shows the ratio of choosing C for LLM agents over 100 runs when observing different last action pairs.

Own Action	Coplayer Action	Ratio of choosing C
C	C	100.0%
C	D	0.0%
D	C	0.0%
D	D	49.0%

4.1 Environment settings

We outline the key features and settings of the environment used to evaluate the proposed framework. The strategic interactions among agents are simulated using the prisoner’s dilemma, a choice we make to test our framework under a high level of social dilemma. The game’s payoff matrix is as follows: mutual cooperation (CC) yields 3 points for each player, unilateral cooperation with defection by the other player (CD or DC) results in 0 points for the cooperator and 5 points for the defector, and mutual defection (DD) gives 1 point to each player.

The system comprises 20 agents situated within a network, which is initially structured using an Erdos-Renyi model with a 0.25 probability of forming links and keeping fixed within each round, although we evaluate the method over different instances of such networks. Each round of experiment on each network consists of 20 time steps, during which agents engage in the prisoner’s dilemma with adjacent coplayers in their neighborhood sequentially. Due to the relatively small network size, the behavior of SLAs tends to stabilize within 20 time steps for the majority of the experiments. We utilized LLaMa3-70b LLM, accessed via LangChian and Groq platforms. The model’s temperature was set to 0.8. The neural networks employed for the actor and critic components each contain one hidden layer with 256 neurons. The learning rates were configured at 0.001 for the actor and 0.005 for the critic, with a discount factor of 0.99. Both training and evaluation were performed on the High-Performance Cluster (HPC) at [Anonymous] University.

4.2 Results from micro-level validation of LLM agent

LLM agents possess the capability to make decisions that are akin to human decisions across various tasks. To ensure that the experimental results in our project are both reasonable and meaningful, we conducted several micro-level validations of LLM agent behavior to address the questions outlined in the previous section. Based on these findings, we crafted prompts incorporating different types of information for subsequent experiments.

First, we design prompts that describe the prisoner’s dilemma and set objectives for the LLM agents to guide their decisions. While it remains uncertain whether LLMs fully understand different games, we avoid explicitly mentioning "prisoner’s dilemma" to minimize bias in the LLM’s behavior. Instead, we present the payoff matrix without labeling the game. For the objective, we instruct the SLA to maximize its rewards while noting that it may interact with the same co-player multiple times. This is intended to encourage SLAs to take actions that balance short-term and long-term payoffs, much like how humans make strategic decisions based partly on the likelihood of future interactions. Additionally, we incorporate chain-of-thought (CoT) prompting [56] to encourage more strategic reasoning. By default, SLA agents always have access to the most recent action pairs from their interactions with co-players. Figure 2 shows the prompts used when SLA agents have access to the latest action pairs.

Table 2: Micro-level validation of Network Ratio. This table shows the percentage of choosing C for LLM agents over 100 runs when observing different last action pairs and cooperation rate of themselves and their neighborhood in history. Rarely, Sometimes, Often for below 33%, between 33% and 66%, and over 66% of cooperation rate, respectively.

	Rarely			Sometimes			Often
	Rarely	Sometimes	Often	Rarely	Sometimes	Often	Rarely	Sometimes	Often
[C, C]	0.0	87.0	99.0	0.0	100.0	100.0	0.0	100.0	100.0
[C, D]	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	3.0
[D, C]	0.0	0.0	0.0	0.0	0.0	9.0	0.0	0.0	98.0
[D, D]	0.0	1.0	68.0	0.0	28.9	98.0	0.0	57.9	100.0

At the initial step of the game, because there is no history of the last action pairs, we will first delete the red part. Besides, we will add "Consider the proposed scenario and act as if you were taking part in it." at the beginning of the prompt as the no prior information prompt. Through extensive testing and modifications of the prompt phrasing, we observed that this change could significantly influence the LLM’s behavior from consistently choosing D to 50 % rate of choosing C. In order to simulate more human-like behavior, we expected the LLM to exhibit some likelihood of choosing C in the ’no prior information’ scenario, to reflect how some individuals might seek to build trust initially. Therefore, we adopted this enhanced prompt for the initial ’no prior information’ condition and reserved the original prompt as default for other scenarios.

After finalizing the structure of our prompts, we aimed to explore how different types of information influence LLM’s behavior and to assess whether these influences are both reasonable and consistent. We designed three distinct types of information: 1) Last action history (LA), which encompasses the immediate past interactions between an agent and their co-player, detailing the actions from their last game. 2) Cooperation ratio of agent and opponent (AR), reflecting the overall cooperation rate of each agent with every other agent across all previous interactions. 3) Cooperation ratio of both agent and agent’s neighbors (NR), which includes the agent’s own cooperation ratio from all previous interactions and the cooperation ratio of their neighboring agents in history. So when building the prompts of different information types, we will add information accordingly to the red part in Figure 2. The placeholder in the prompts like "{your_action}" or "{neighbor_ratio}" will be fed with real corresponding values in the simulation (See the SI document, section B, for more details).

Our experiments showed that using numeric values for the cooperation ratios of agents and their neighbors led to unreliable results, as LLM agents were overly sensitive to thresholds like a 70% cooperation rate, which completely dictated their cooperative behavior. To overcome this, we transitioned to qualitative categories for cooperation: We used Rarely (when cooperation is below 33%), Sometimes (33% to 66%), and Often (above 66%), although this can be extended to higher levels of granularity. We conducted several micro-level validations across different scenarios. Here, we present two key results, more comprehensive results were also been conducted in different scenarios (See the SI document, section D).

Table 1 shows the ratio of LLM agents choosing C when observing different action pairs in the latest interaction with their co-players. The results consistently demonstrate a pattern of reciprocity: LLM agents reciprocate cooperation (100%) when both they and their opponents cooperate. Conversely, if an LLM agent cooperates and the opponent defects, or vice versa, the LLM agent responds with defection (100%). However, in cases where both the LLM agent and the opponent defect, the response is more variable, with cooperation occurring about half the time (49%). These findings indicate that while LLM agents strongly reciprocate cooperation, their response to mutual defection is less predictable, reflecting a more nuanced behavioral strategy. While 100% cooperation is unrealistic, the overall variation across different scenarios is reasonable and provides sufficient scope to test the impact of the PPA agent.

Table 2 presents the ratio of LLM agents choosing C when observing different action pairs in the latest interaction with their coplayers but also the cooperation rate of themselves and their neighborhood in the history. Initially, we observe that LLM agent’s behavior is shaped by both the history of interactions and neighborhood information at the system level. For example, as shown in Table 1, LLM agents consistently choose C (cooperate) when the last action pairs observed are [C, C]. Conversely, if both the agents and their neighborhood "rarely" cooperate and this information is disclosed, they will switch to choosing D (defect) with 100% certainty. This change demonstrates that LLM agents adjust the weighting of their attention on the information observed, leading to corresponding shifts in their behaviors.

Second, the results reveal a consistent pattern: the more cooperative behavior exhibited by an agent’s opponents and neighbors, the more likely the agent is to cooperate. Specifically, when both the opponent and neighbors display a history of cooperation (i.e., [C, C]), LLM agents typically cooperate with high frequency (87-100%). However, even in scenarios with [C, C], the LLM’s cooperation rate can drop significantly if the overall network cooperation ratio is low (e.g., "Rarely"). These findings indicate that LLM agents adapt their responses based on the current prompts and available information reasonably, laying a reliable foundation for subsequent experiments involving LLM agent behavior.

Table 3: The performance of different scenarios in the network averaged over 20 steps and at the final step. The first two columns show the average percentage of cooperation over time and the average percentage of cooperation at the end of the round. The next two columns demonstrate the average social welfare (SW) over time and the average social welfare (SW) at the end of the round, which are normalized by the number of interactions. All of the results are averaged across 10 rounds.

Scenario	Avg. C Rate (%)	Final C Rate (%)	Avg. SW	Final SW
RL	88	100	5.59	6.00
LA+NR	76	88	5.16	5.58
LA	62	80	4.64	5.28
LA+AR	11	04	2.51	2.16

Moreover, the table reveals an interesting dynamic in the LLM’s behavior. As the network’s history becomes more cooperative, the LLM’s cooperation rate increases, but with diminishing returns. For example, when the network’s history changes from "Rarely" to "Sometimes" cooperative, the LLM’s cooperation rate jumps significantly (e.g., from 0% to 87% in the [C, C] scenario). However, further increases in the network’s cooperation rate lead to smaller increases in the LLM’s cooperation rate. This suggests that the LLM is most sensitive to initial changes in the social context and becomes less responsive to additional changes once a certain threshold of cooperation is reached.

4.3 PPA Effect on System Performance and Cooperation Rate

In the previous section, we designed prompts to integrate game structure and information access. Building on this, SLAs are then constructed, and the PPA learns to modulate information access for each SLA, aiming to optimize the social welfare of all agents. It is important to note that, in many cases, the PPA can only modulate or expose information from more distant interactions, while the history of recent interactions between an SLA and its co-player remains is expected to be retained in most applications. Thus, we used the no information action, only for the first period within each round, when SLAs have no prior interactions with each other. We established several baselines for comparison, each using specific information within the same round. We conducted ten rounds for each method (more rounds are expected to be added), and the comparison of results can be found in Table 3.

Here We present the normalized social welfare and the system’s cooperation rate, averaged across all rounds, with each round involving a different random network instance. Additionally, we report both the time-averaged welfare and cooperation rate across rounds, as well as the outcomes from the final step of the game. We can make the following observations:

First, we observe some expected yet noteworthy results from the non-PPA baseline interventions. When each SLA is only reminded of its most recent interaction with other agents, we see a moderate level of cooperation, consistent with research suggesting that strategies based on recent interaction history, like the well-known tit-for-tat strategy [17], can foster cooperative behavior. Extending this by consistently providing each SLA with the average cooperation rate of its neighbors further increases the cooperation rate, aligning with findings that cooperation is better promoted within tightly connected clusters of agents [57]. However, providing the overall cooperation rates of individual neighbors yields the lowest social welfare and cooperation levels. This aligns with previous findings that extending memory of past behaviors can diminish cooperation, as it makes agents less forgiving and more prone to defect if a co-player has a history of lower cooperation.

While these consistent signaling schemes already show considerable variation in performance and cooperation outcomes, our results demonstrate that the PPA agent, which dynamically selects among these options, outperforms all baselines across all four metrics. Although the primary objective of the RL manager is to maximize social welfare, its interventions also significantly enhance the system’s cooperation rate, aligning with expectations for the PD game. A 100% cooperation rate remains unrealistic, as noted earlier; however, our findings suggest that RL can effectively learn and adaptively deliver targeted information, boosting both social welfare and cooperation rates in complex systems.

These findings can also be observed in the left part of Figure 3, which shows the average cooperation rate and its standard deviation over 20 periods. It is evident that the RL manager achieves a higher cooperation rate more rapidly compared to all other baseline methods, corroborating our earlier analysis. The figure also reveals that all methods, except for the one (providing global information), lead to an increase in the system’s cooperation rate over time. The results of social welfare evolution in the system over time follow similar trends and can be checked in section D of SI.

Finally, we aim to examine how the behavior of SLAs evolves over time across different methods. To do this, we analyze the ratios of different action pairs among SLAs’ interaction—[C, C], [C, D], [D, C], and [D, D]—as they change over time. The results are presented in the right part of Figure 3. Firstly, the ratio of [C, C] actions for the RL method increases rapidly, reaching 100% by the 10th timestep.

Similar trends are observed for the LA and LA+NR methods, although they increase at a slower pace and do not converge to 100%. For the AN method, we observe an opposite trend: the ratio of [D, D] actions increases sharply while others decrease. This supports our previous claims that the AR method tends to steer the system toward defection. For the RL, LA, and LA+NR methods, a common trend is observed where the ratio of [D, D] initially increases, reaching a peak in the early steps before subsequently decreasing. This pattern begins as LLM agents, initially devoid of memory, randomly choose between C (cooperate) and D (defect) under the ’no prior information’ prompts. This randomness often results in pairs of [C, D] or [D, C], which can escalate into a predominance of [D, D] outcomes. This occurs as agents strive to avoid being exploited by their co-players while also seeking to gain an advantage themselves.

5 Conclusion

In this paper, we propose a framework comprising multiple Strategic LLM Agents (SLAs) positioned in a random network, interacting with their neighbors, and a Pro-social Promoting Agent (PPA) that dynamically provides information to SLAs to foster pro-social behavior and maximize social welfare. Each SLA receives prompts, including descriptions of pairwise strategic games, objectives, and additional information from the PPA, to make decisions such as cooperation or defection. The information set and prompts are refined through micro-level validation to ensure SLAs’ behaviors are consistent and reasonable. The PPA, trained via reinforcement learning, observes relevant information from both SLAs in each interaction and determines the optimal level of information to provide, aiming to maximize social welfare. The evaluation results demonstrate that PPA with RL outperforms other baseline methods in various aspects. Furthermore, the analysis of the learned behavior of PPA generates meaningful insights.

This work has a few limitations. First, the limited sample size of the evaluation (number of rounds) may introduce fluctuations in the observed trends, primarily due to constraints in computational resources (see the SI for details). However, the confidence intervals in our results indicate that increasing the number of rounds is unlikely to significantly alter the key findings. A potential solution to this issue is using small-scale LLMs fine-tuned by large models, which can replicate certain behaviors of the larger models while being much more computationally efficient. Additionally, future work could explore different network structures beyond random networks, test the framework across other strategic games, and incorporate more granular information tiers for PPA intervention.

References

[1] Andrew M Colman. The puzzle of cooperation, 2006.
[2] Bruce Kogut and Udo Zander. What firms do? coordination, identity, and learning. Organization science, 7(5):502–518, 1996.
[3] Ernst Fehr and Herbert Gintis. Human motivation and social cooperation: Experimental and analytical foundations. Annu. Rev. Sociol., 33(1):43–64, 2007.
[4] Max Kleiman-Weiner, Mark K Ho, Joseph L Austerweil, Michael L Littman, and Joshua B Tenenbaum. Coordinate to cooperate or compete: abstract goals and joint intentions in social interaction. In CogSci, 2016.
[5] Werner Hoffmann, Dovev Lavie, Jeffrey J Reuer, and Andrew Shipilov. The interplay of competition and cooperation. Strategic Management Journal, 39(12):3033–3052, 2018.
[6] Jörg Gross, Zsombor Z Méder, Carsten KW De Dreu, Angelo Romano, Welmer E Molenmaker, and Laura C Hoenig. The evolution of universal cooperation. Science advances, 9(7):eadd8289, 2023.
[7] Ernst Fehr and Ivo Schurtenberger. Normative foundations of human cooperation. Nature human behaviour, 2(7):458–468, 2018.
[8] Jörg Gross, Sonja Veistola, Carsten KW De Dreu, and Eric Van Dijk. Self-reliance crowds out group cooperation and increases wealth inequality. Nature Communications, 11(1):5161, 2020.
[9] David G Rand and Martin A Nowak. Human cooperation. Trends in cognitive sciences, 17(8):413–425, 2013.
[10] Elinor Ostrom. Governing the commons: The evolution of institutions for collective action. Cambridge university press, 1990.
[11] Martin A Nowak. Five rules for the evolution of cooperation. science, 314(5805):1560–1563, 2006.
[12] Hisashi Ohtsuki, Christoph Hauert, Erez Lieberman, and Martin A Nowak. A simple rule for the evolution of cooperation on graphs and social networks. Nature, 441(7092):502–505, 2006.
[13] Ernst Fehr and Simon Gächter. Altruistic punishment in humans. Nature, 415(6868):137–140, 2002.
[14] Kevin R McKee, Andrea Tacchetti, Michiel A Bakker, Jan Balaguer, Lucy Campbell-Gillingham, Richard Everett, and Matthew Botvinick. Scaffolding cooperation in human groups with deep reinforcement learning. Nature Human Behaviour, 7(10):1787–1796, 2023.
[15] Weixun Wang, Jianye Hao, Yixi Wang, and Matthew Taylor. Achieving cooperation through deep multiagent reinforcement learning in sequential prisoner’s dilemmas. In Proceedings of the First International Conference on Distributed Artificial Intelligence, pages 1–7, 2019.
[16] Dong-Ki Kim, Matthew Riemer, Miao Liu, Jakob Foerster, Michael Everett, Chuangchuang Sun, Gerald Tesauro, and Jonathan P How. Influencing long-term behavior in multiagent reinforcement learning. Advances in Neural Information Processing Systems, 35:18808–18821, 2022.
[17] Robert Axelrod and William D Hamilton. The evolution of cooperation. science, 211(4489):1390–1396, 1981.
[18] Zachary Fulker, Patrick Forber, Rory Smead, and Christoph Riedl. Spite is contagious in dynamic networks. Nature communications, 12(1):260, 2021.
[19] Robert L Axtell and J Doyne Farmer. Agent-based modeling in economics and finance: Past, present, and future. Journal of Economic Literature, pages 1–101, 2022.
[20] Nunzio Lorè and Babak Heydari. Strategic behavior of large language models: Game structure vs. contextual framing. arXiv preprint arXiv:2309.05898, 2023.
[21] Chengyi Xia, Juan Wang, Matjaž Perc, and Zhen Wang. Reputation and reciprocity. Physics of life reviews, 46:8–45, 2023.
[22] Wayne E Baker and Nathaniel Bulkley. Paying it forward vs. rewarding reputation: Mechanisms of generalized reciprocity. Organization science, 25(5):1493–1510, 2014.
[23] Daniel Balliet, Laetitia B Mulder, and Paul AM Van Lange. Reward, punishment, and cooperation: a meta-analysis. Psychological bulletin, 137(4):594, 2011.
[24] Tatsuya Sasaki, Satoshi Uchida, and Xiaojie Chen. Voluntary rewards mediate the evolution of pool punishment for maintaining public goods in large populations. Scientific reports, 5(1):8917, 2015.
[25] Matjaž Perc, Jillian J Jordan, David G Rand, Zhen Wang, Stefano Boccaletti, and Attila Szolnoki. Statistical physics of human cooperation. Physics Reports, 687:1–51, 2017.
[26] Francisco C Santos, JF Rodrigues, and Jorge M Pacheco. Graph topology plays a determinant role in the evolution of cooperation. Proceedings of the Royal Society B: Biological Sciences, 273(1582):51–55, 2006.
[27] Sanjay Jain and Sandeep Krishna. A model for the emergence of cooperation, interdependence, and structure in evolving networks. Proceedings of the National Academy of Sciences, 98(2):543–547, 2001.
[28] Katrin Fehl, Daniel J van der Post, and Dirk Semmann. Co-evolution of behaviour and social network structure promotes human cooperation. Ecology letters, 14(6):546–551, 2011.
[29] Nicolas Anastassacos, Stephen Hailes, and Mirco Musolesi. Partner selection for the emergence of cooperation in multi-agent systems using reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7047–7054, 2020.
[30] Alexander J Stewart and Joshua B Plotkin. Small groups and long memories promote cooperation. Scientific reports, 6(1):26889, 2016.
[31] Alex Bradley, Claire Lawrence, and Eamonn Ferguson. Does observability affect prosociality? Proceedings of the Royal Society B: Biological Sciences, 285(1875):20180116, 2018.
[32] Philip Brookins and Jason Matthew DeBacker. Playing games with gpt: What can we learn about a large language model from canonical strategic games? Available at SSRN 4493398, 2023.
[33] Yiting Chen, Tracy Xiao Liu, You Shan, and Songfa Zhong. The emergence of economic rationality of gpt. arXiv preprint arXiv:2305.12763, 2023.
[34] Steve Phelps and Yvan I Russell. Investigating emergent goal-like behaviour in large language models using experimental economics. arXiv preprint arXiv:2305.07970, 2023.
[35] Elif Akata, Lion Schulz, Julian Coda-Forno, Seong Joon Oh, Matthias Bethge, and Eric Schulz. Playing repeated games with large language models. arXiv preprint arXiv:2305.16867, 2023.
[36] Caoyun Fan, Jindou Chen, Yaohui Jin, and Hao He. Can large language models serve as rational players in game theory? a systematic analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17960–17967, 2024.
[37] Ayato Kitadai, Yudai Tsurusaki, Yusuke Fukasawa, and Nariaki Nishino. Toward a novel methodology in economic experiments: Simulation of the ultimatum game with large language models. In 2023 IEEE International Conference on Big Data (BigData), pages 3168–3175. IEEE, 2023.
[38] Yadong Zhang, Shaoguang Mao, Tao Ge, Xun Wang, Adrian de Wynter, Yan Xia, Wenshan Wu, Ting Song, Man Lan, and Furu Wei. Llm as a mastermind: A survey of strategic reasoning with large language models. arXiv preprint arXiv:2404.01230, 2024.
[39] Fulin Guo. Gpt agents in game theory experiments. arXiv preprint arXiv:2305.05516, 2023.
[40] Qiaozhu Mei, Yutong Xie, Walter Yuan, and Matthew O Jackson. A turing test of whether ai chatbots are behaviorally similar to humans. Proceedings of the National Academy of Sciences, 121(9):e2313925121, 2024.
[41] Lin Xu, Zhiyuan Hu, Daquan Zhou, Hongyu Ren, Zhen Dong, Kurt Keutzer, See-Kiong Ng, and Jiashi Feng. Magic: Investigation of large language model powered multi-agent in cognition, adaptability, rationality and collaboration. In ICLR 2024 Workshop on Large Language Model (LLM) Agents, 2023.
[42] Jinhao Duan, Renming Zhang, James Diffenderfer, Bhavya Kailkhura, Lichao Sun, Elias Stengel-Eskin, Mohit Bansal, Tianlong Chen, and Kaidi Xu. Gtbench: Uncovering the strategic reasoning limitations of llms via game-theoretic evaluations. arXiv preprint arXiv:2402.12348, 2024.
[43] Gati Aher, Rosa I Arriaga, and Adam Tauman Kalai. Using large language models to simulate multiple humans. arXiv preprint arXiv:2208.10264, 2022.
[44] John J Horton. Large language models as simulated economic agents: What can we learn from homo silicus? Technical report, National Bureau of Economic Research, 2023.
[45] Lisa P Argyle, Ethan C Busby, Nancy Fulda, Joshua R Gubler, Christopher Rytting, and David Wingate. Out of one, many: Using language models to simulate human samples. Political Analysis, 31(3):337–351, 2023.
[46] Benjamin S Manning, Kehang Zhu, and John J Horton. Automated social science: Language models as scientist and subjects. arXiv preprint arXiv:2404.11794, 2024.
[47] Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong Sun. Communicative agents for software development. arXiv preprint arXiv:2307.07924, 6, 2023.
[48] Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023.
[49] Junda He, Christoph Treude, and David Lo. Llm-based multi-agent systems for software engineering: Vision and the road ahead. arXiv preprint arXiv:2404.04834, 2024.
[50] Chong Zhang, Xinyi Liu, Mingyu Jin, Zhongmou Zhang, Lingyao Li, Zhengting Wang, Wenyue Hua, Dong Shu, Suiyuan Zhu, Xiaobo Jin, et al. When ai meets finance (stockagent): Large language model-based stock trading in simulated real-world environments. arXiv preprint arXiv:2407.18957, 2024.
[51] Wenyue Hua, Lizhou Fan, Lingyao Li, Kai Mei, Jianchao Ji, Yingqiang Ge, Libby Hemphill, and Yongfeng Zhang. War and peace (waragent): Large language model-based multi-agent simulation of world wars. arXiv preprint arXiv:2311.17227, 2023.
[52] Jen-tse Huang, Eric John Li, Man Ho Lam, Tian Liang, Wenxuan Wang, Youliang Yuan, Wenxiang Jiao, Xing Wang, Zhaopeng Tu, and Michael R Lyu. How far are we on the decision-making of llms? evaluating llms’ gaming ability in multi-agent environments. arXiv preprint arXiv:2403.11807, 2024.
[53] Shaoguang Mao, Yuzhe Cai, Yan Xia, Wenshan Wu, Xun Wang, Fengyi Wang, Tao Ge, and Furu Wei. Alympics: Language agents meet game theory. arXiv preprint arXiv:2311.03220, 2023.
[54] Tomer Ullman. Large language models fail on trivial alterations to theory-of-mind tasks, 2023.
[55] Rose McDermott. Internal and external validity. Cambridge handbook of experimental political science, 27, 2011.
[56] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
[57] David A Gianetto and Babak Heydari. Network modularity is essential for evolution of cooperation under uncertainty. Scientific reports, 5(1):1–7, 2015.