## 1 Introduction

Game theory has traditionally been centered around finding players’ strategies in equilibrium. In recent years, there has been growing interest in the inverse setting of learning game parameters from observed player actions [Vorobeychik, Wellman, and Singh2007, Blum, Haghtalab, and Procaccia2014, Waugh, Ziebart, and Bagnell2011]

. Recent work by ling2018game ling2018game tackle this problem in the zero-sum setting by providing an end-to-end learning framework to learn game parameters such as payoff matrices and chance node probability distributions assuming the actions are sampled from the Quantal Response equilibrium. At the core of the framework is a differentiable game solving module.

However, their proposed method suffers from two major flaws. Firstly, the assumption that players behave in accordance to the QRE severely limits the space of player strategies, and is known to exhibit pathological behavior even in one-player settings. Second, their solvers are computationally inefficient and are unable to scale.

Our work addresses these deficiencies in two ways. First, we propose the Nested Logit Quantal Response equilibrium (NLQRE), which draws upon ideas from from behavioral science and allows for varying levels of player rationality at each stage of the game. We show that the NLQRE is strictly more general than the models considered by ling2018game, and may not be replicated by a straightforward scaling of payoff matrices. We derive the required gradients and show that player rationality can be learned via gradient descent can be learned using the same end-to-end learning framework. Second, we substantially reduce training time by reformulating the backward pass as a min-max convex optimization problem and uses state-of-the-art first order primal-dual methods for both the forward pass and backward pass. Unlike previous work, which relied on second-order methods, our first-order solver does not require explicit formation of Hessians and only requires access to a fast best-response oracle. In our evaluation with random payoff matrices and one-card poker, we report orders of magnitude of speedups. Lastly, we evaluate the NLQRE on real-world data in an one-player information gathering game and provide qualitative insights. In total, we believe that our work is a significant step towards the practical learning of human behavior in zero-sum settings.

## 2 Background and related work

Although much less well studied than traditional equilibrium finding, there are several approaches aimed at the task of learning games in the setting where underlying game payoffs are unknown. These include methods which rely on specific game structure such as symmetry [Vorobeychik, Wellman, and Singh2007], operate in an active setting [Blum, Haghtalab, and Procaccia2014], or focus primarily on normal form games and straightforward linear settings [Waugh, Ziebart, and Bagnell2011]. ling2018game ling2018game provide an alternative framework which embeds a differentiable game solver within another gradient based learner (e.g., a deep network), as illustrated in Figure 1. This enables game parameters to be learned via simple gradient descent. We now describe their framework briefly.

Suppose is the zero-sum payoff matrix given some features and game parameters which we wish to learn. The game solver takes in and outputs the QRE , which correspond to mixed strategies of the min and max player. During training, the log loss of the solver’s predicted strategies is computed against observed actions . The game parameters are then optimized by minimizing . This is performed by propagating gradients backwards through the game solver and performing gradient descent, where the required gradients for the backward pass are readily derived by using the implicit function theorem. The training phase is summarized in Algorithm 1. We will now touch on two key ideas from decision and game theory, which will eventually culminate in the proposed NLQRE.

### 2.1 Nested-logit choice models

One of the fundamental research problems in behavioral science is to mathematically model seemingly irrational (or non-utility maximizing) human behavior. Among the most important models is the class of the random utility models (RUM) [Thurstone1927].
The Logit model^{1}^{1}1

Logits are more commonly known by the machine learning community as the ‘softmax’ operator.

is the most notable RUM, where given a set of alternatives each with (known) utility , the probability that alternative is picked is . It is equivalent to the probability that alternative has highest utility under Gumbel noise, i.e., , where are i.i.d. Gumbel distributed.However, the logit model suffers from limitations. This includes classic ‘red and blue bus’ pathologies^{2}^{2}2Suppose there are 3 alternatives for transport – a red bus, a blue bus, and a car. The player derives the same utility for each alternative, . Applying the logit model gives an equal probability of choosing each vehicle. One would however, expect the car to be taken with probability and each bus to be chosen with probability , since the color of buses should have no impact on decisions. [Luce2012] which restrict the class of behaviors permitted. Specifically, logit models obey the property of independence of irrelevant alternatives, which does not take into account cases when alternatives are ‘qualitatively’ similar.

Nested-logit (NL) models [Train2009] address this limitation by grouping fundamentally similar alternatives together and allows for correlations between ’s belonging to the same group. In a two-level NL model, is divided into disjoint clusters, with alternatives belonging to cluster chosen with probability

where ’s are parameters governing noise correlation. These probabilities may be interpreted as a two-stage decision making process: in the first stage, a cluster is chosen, and in the second stage, the specific action is selected based on (scaled) softmax on within the cluster. The probability of choosing each cluster in the first stage is given by the softmax over the (scaled) log-sum-exp of each cluster. When , the standard logit model is recovered, and when , the ‘elimination by aspects’ is obtained [Tversky1972]. NL models can have multiple layers, leading to a NL tree representing the nested grouping. The reader is directed to the book by train2009discrete for background about nested logits and their various interpretations.

### 2.2 Quantal response equilibria (QRE)

We now turn our attention to 2-player games. Seminal work by McKelvey propose QRE as a noisy alternative to NE. [McKelvey and Palfrey1995]. Similar to logit choice models, the QRE is the equilibrium obtained when payoffs are perturbed by noise obeying a Gumbel distribution. Formally, is a QRE of a normal form game with action sets and for the two players and payoff matrix if

where is a parameter governing the level of agent rationality. Observe that as , players behave uniformly at random, while approaches a NE as . For zero-sum games, it is further known [Mertikopoulos and Sandholm2016] that QRE is the unique solution of the following convex-concave program

subject to |

For a two-player extensive form game characterized by a game tree with information sets and for the min and max player respectively, ling2018game show that when , the QRE in reduced normal form of the game is equivalent to the solution of the following regularized min-max problem, where and are the players’ strategies in sequence form [Von Stengel1996].

(1) |

In the above, is the sequence form payoff matrix and and are the sequence form constraint matrices. denotes the possible actions at information set , while is the action (from the same player) preceding . In the sequence form, one works with realization plans

as opposed to probability vectors. These realization plans represent probabilities of choosing a given sequence, while the constraint matrices

are matrices containing and contain parent-child relationships in the game tree. The sequence form is significantly more compact than the normal form while retaining virtually all of its strategic elements.## 3 Nested-logit quantal response equilibria

Our proposed Nested-logit QRE (NLQRE) is a generalization of both the QRE (in zero-sum games) and NL models. That is, it generalizes NL models to two player zero-sum games, or equivalently, extends the QRE by permitting a more general nested logit structure. This allows us to model a far wider range of player behaviors, and in particular, cases where player rationality varies between stages of the game. We assume that the grouping of actions within each information set is known a-priori. The NLQRE is given by the unique solution to the following optimization problem

subject to | (2) |

The NL model is recovered in a one-player setting (i.e., is a constant vector) and the QRE is recovered when there is no nesting and ’s are equal. ling2018game assumes that all the ’s are equal to and focus on learning entries of the payoff matrix by exploiting the smoothness of QRE solutions. This allows us to employ gradient-based approaches [Amin, Singh, and Wellman2016] for learning. In this paper, we do not assume ’s are known in our solution concept and instead treat them as parameters to be learned.

The additional representation power brought by introducing ’s to 1 cannot be achieved by a simple scaling of the payoff matrix in the original formulation by ling2018game, even in the non-nested, simultaneous move normal form settings. To see why this is so, consider the game of symmetric rock-paper-scissors with non-uniform rewards (i.e., the payoffs for winners depend on their specific action). Suppose the game is played between ‘strong’ and ‘weak’ players, and this is reflected by low and high parameters respectively. Due to differing ’s for each player, the strategies of the two players in equilibrium are different. However, scaling , or even changing individual payoffs for winners (while maintaining symmetry) can only result in symmetric equilibrium.

Remark 1. Readers familiar with nested logits may recall that the most common form of nested logits do not admit chance nodes (or in our 2-player setting, parallel information sets). It may be shown that there is a natural way of doing so by considering representing each alternative as a pure strategy in the reduced normal form, and by nesting each action based on the information sets which have a non-zero probability of being reached. The details are presented in the appendix.

Remark 2. The expression in (2) is fairly general. Broadly speaking, our framework allows for 2 types of nesting. First we allow for nesting via information sets (i.e., each information set gets its own , see Remark 1), and second, by clustering actions within an information set, which is achieved by introducing intermediate information sets (e.g., the ‘red and blue bus’ example). Our experiments in Section 5 focus on the former. However (2) and our proposed solver is able to handle the latter case, assuming that the nesting structure is known a-priori.

Remark 3. The form of regularization is known as in (2) is known as the dilated entropy regularization and the ’s may be interpreted as governing the degree of regularization or smoothing. Its form was first introduced by hoda2010smoothing, and follow-up work by kroer2017theoretical provided specific strong-convexity bounds on the regularizer. In particular, a specific instantiation of ’s results in a best-response problem which is 1-strongly convex. The authors exploit this fact to yield some of the fastest solvers for Nash equilibria of two player zero-sum extensive form games. Note however, that their motivation is primarily computational in nature, as opposed to modelling, which is the case for us.

### 3.1 NLQRE solver

Following the ideas proposed by ling2018game, we present a naive solver for the NLQRE based on Newton’s method. Denote and as sets of possible information sets immediately following or . Define
, and let be the information set immediately preceding the action , i.e. where . The KKT conditions for (2) are, for all , and for all ,

(3) |

These are necessary and sufficient conditions for NLQRE, implying that the NLQRE can be found by applying Newton’s method to (3), yielding the following updates

(4) | |||

(5) |

contains terms in (3) and is defined analogously in terms of the appropriate and ’s. Observe that and are diagonally dominant and symmetric, implying that they are positive definite. In the backward pass, we require the gradients of the loss with respect to and . Similar to prior work [Gould et al.2016, Amos and Kolter2017, Ling, Fang, and Kolter2018], this may be done by applying the implicit function theorem or by simply manipulating differentials. This yields the gradients ; for , for , where

(6) | |||

(7) | |||

(8) |

## 4 Fast forward and backward pass solvers

In the framework of ling2018game, each gradient step in Algorithm 1 involves solving an optimization problem. Thus, having efficient solvers is crucial in scaling up. In the naive solver, the forward pass is solved using Newton’s method and we need to solve the system of linear equations (5) in each iteration. When the game tree is large, solving the system of linear equations in (5) multiple times dramatically slows down training. Similarly for the backward pass, one needs to solve a single linear system shown in (8). When the game is large, naively solving the linear system is also prohibitively slow, even when utilizing sparse matrices. This serves as motivation for a first-order iterative method (FOM) which do not require the solution of a linear system as a subroutine. FOMs are also computationally attractive for solving extensive form games because of the underlying tree structures in games which may be exploited. We will focus on optimization problems in the following min-max form.

(9) |

where and are strictly convex functions. It is obvious from (2) that the forward pass in our problem solves a problem in this form. We will show later that the backward pass problem shown in (8) can also be seen as solving a problem in this format.

Many methods to solve (9) have been proposed. In this paper, we adapt the method proposed by chambolle2016ergodic chambolle2016ergodic^{3}^{3}3Note that the algorithm by chambolle2016ergodic is more general and applies beyond game solving.. This, as well as many other first order methods apply best response subroutines towards smoothed versions of the min or max original problem taken in isolation. The solution is obtained by alternating between best-responses to minimization and maximization. Algorithm 2 gives the high-level overview of the optimization procedure, where BR are smoothed best responses with appropriately chosen Bregman divergences , their associated convex functions , and ‘step sizes’ .

We first set for convenience. For Algorithm 2 to be practical, we will require the best response oracles to be computed efficiently. By setting and to be of specific form similar to and respectively will simplify and to be (up to a factor) of the forms

which are efficiently computed by exploiting the structure of the problem in extensive form games. This avoids the need to solve a linear system with as part of the design matrix. The remainder of this section outlines the procedures required for both forward and backward passes. For brevity, we discuss this from the view of the minimization – the maximization subroutine is entirely analogous. Lastly, we remark that the computational advances in this section are independent of the NLQRE, i.e., they remains applicable to the original framework of ling2018game.

#### Forward Pass

For this section, the when referring Algorithm 2. Setting to be the entropy terms in (2) and gives the expression in the form of (9). The natural divergence to be chosen is the standard entropy divergence adapted to the dilated setting (dropping terms in which do not contain ).

where a similar expression holds for . Plugging into the expression for gives

It is known that, may be solved by a single bottom-up traversal of the game tree and a single sparse matrix-vector multiplication [Hoda et al.2010]. At each information set , we solve for the ‘behavioral’ best response (i.e., assuming that information set was the root). Each of these sub-problems may be expressed in closed form using log-sum-exp and softmax functions. The sequence form is recovered from behavioral strategies with a single downwards traversal of the tree. The precise details are contained in the appendix.

#### Backward Pass

The backward pass also requires solving a linear system to obtain . We first begin by making the crucial observation that the (necessary and sufficient) KKT conditions of the following optimization problem is precisely the linear system in (8).

subject to | (10) |

Note that and are constants in the backwards pass, here we are optimizing over , which are not probabilities. Since and are positive definite, this is a convex-concave problem of the form required by Algorithm 2. We select the natural distance generating function which yields (ignoring terms containing only ),

Plugging this into the expression for and rearranging gives

(11) |

Letting in (16) gives the KKT conditions

where are Lagrange multipliers. Multiplying by gives a linear system in

(12) |

After solving for , one may solve for

(13) |

###### Proposition 1.

The derivation involves exploiting the tree-structure inherent in extensive form games. Computational details and proofs are deferred to the appendix.

## 5 Experiments

The proposed first order method was implemented using Cython. We chose to do so since the best-response subroutines require tree-traversals, which are expensive in Python. while the second order method used the Numpy and Scipy libraries for the solution of linear systems. Where possible, we utilized the Scipy sparse matrix library. This was seen to provide a significant speedup for sparse

for both our method and Newton’s method. The PyTorch automatic differentiation library

[Paszke et al.2017] was used to automatically obtain gradients for components outside the game solving module.### 5.1 Synthetic datasets

Here we use randomly generated extensive form games to illustrate the computational efficiency of our proposed first order method compared to the second order method used by ling2018game. We evaluate the solvers for the forward and backward passes in isolation. The experiments are run over several depths . Normal form games have . When , we adopt the following extensive form game. players play distinct simultaneous sub-games in succession, where each simultaneous sub-game has actions. Transitions to the next sub-game is governed by the joint action by both players, i.e., the size of will be exponential in . The payoff matrices were generated with each non-zero entry uniformly chosen from , and rationality parameter for each information set uniformly and independently chosen between . All timings presented are wall-clock timings. Experiments are run on the cloud with identical Amazon EC2 instances. We set for all evaluations.

#### Evaluation of forward passes

In the forward pass, we compared the baseline Newton solver to our proposed first-order method. However, the termination criterion for the 2 methods are non-identical; as Newton’s method minimizes the residual rather than duality gap. To strike a fair comparison, we evaluated the 2 methods by first running Newton’s method till a residual of less than is achieved. The duality gap of that solution is computed and subsequently used as the termination criterion for the FOM^{4}^{4}4On occasion, the Newton solver gave a gap extremely close to numerical precision. In these cases, we apply to a termination criterion of .. The timings and speedup are averaged over trials and presented in Figure 2.

. Error bars represent 1 standard deviation. Dotted lines are optimal results given the ground truth.

#### Evaluation of backward passes

In the backward pass, the comparison for our proposed FOM is against solving the linear system in (8

) directly. In the loss function, we will concern ourselves with the setting where the true matrix

and parameters are used in computing for the forward pass. This corresponds to the case the model is already fairly well trained. The results over 50 trials are presented in Figure 3.It is clear from both figures that our method scales much better than Newton’s method for randomly generated matrices. Speedups of more than an order of magnitude are fairly common, and the improvement increases with problem size. Furthermore, it was also observed that our method consumed far less memory than sparse solvers. In fact, solving the sparse system when (not plotted) required more than GB of memory. On the other hand, our FOM was able to solve such instances in less than a minute, and with no noticeable increase in memory usage. Note that contains more than 1.4 million rows and columns in this setting.

### 5.2 One-Card Poker

Here we evaluate our method on the game of one card poker. This multi-card extension of Kuhn poker contains most interesting strategic elements of game playing (e.g., bluffing) and was used by ling2018game to illustrate that it is possible to learn distribution of cards in a deck just by observation player actions. However, the authors worked only on tiny settings with just cards – in sequence form, player strategies may be represented in a dimension vector. Furthermore, the authors assumed that there were no varying input features (i.e., the card distributions were identical for each action observed). These assumptions enabled them to achieve significant speedup by solving the game just once in the forward pass, rather than once for each point in the mini-batch. As we will see, their solver is too slow to be of practical use in larger or featurized settings.

Here, we operate in a slightly different setting. Instead of trying to learn underlying card distributions, we learn player rationality parameters. We assume that player rationality is independent of the cards being drawn, and only depends on the past actions of (both) players. In this setting, there are just parameters to be learned. This is independent of the size of the deck.

We generate our data assuming that player rationality is some linear function of a scalar feature, i.e., there are weights to be learned. The weight vector is drawn uniformly from . Feature vectors are drawn between . Our model is . The addition of a small ensures that the ’s will always remain positive; in our experiments, . For each feature, we compute the ’s and find its corresponding equilibrium from which we sample player actions. The training set of size , with an independent test set of size . We minimized the log loss using the Adam optimizer with a batch size of and learning rate of .

Left: log-loss and mse per epoch, Right: time required to obtain a particular log loss or mse. Dotted lines are optimal results given the ground truth.

We compared our solver against Newton’s method, which terminates at a residual of . We fixed for the forward solver and for the backward solver. The results are plotted in Figure 4.

In all cases, both the log-loss is close to optimal given around epochs. As expected, our exacts solver exhibits behavior almost identical to that of Newton’s method on a per-epoch basis. Our solver is significantly faster than the baseline. It was observed that at almost all stages of training, Newton’s method took almost orders of magnitude time in order to learn a model of similar performance. In fact, a single epoch using Newton’s method takes as much time as training the entire model using our solver.

### 5.3 Information gathering dataset

Here we demonstrate the applicability of the nested logit model (i.e., a one player game) using a publically available dataset [Hunt et al.2016]. The game proceeds as follows. Suppose there are 4 faced-down cards ranging from 1-10 placed in a matrix (with potential repetitions). The goal of the game is to select the row containing cards with the largest sum. The game proceeds in 4 stages. At each stage of the game, the player may make a guess prematurely, or spend some points in revealing a new card. At the fourth and final stage, the player has to make a guess. The player obtains a reward of 60 and -50 points for correct and incorrect guesses, and may only guess once. The challenge is for the player to judge if it is worth paying to gather more information. Computationally, the optimal policy may be easily obtained using dynamic programming.

However, humans are rarely perfectly rational. We model bounded rationality using the nested logit model. It is assumed that the level of rationality should be a function of a) how many cards are already open, and b) side information such as one’s educational qualifications. This leads to a natural description of the game with different ’s, each of which is some function of features, which we describe below.

Two models are trained for this experiment. NoFeat refers to the case when there are no features (i.e., we are simply learning ) and Feat

when we are exploiting demographic information. In this case, features comprise the player’s academic qualifications and age, both with one-hot encodings. A player’s age is split into 8 age ranges, and education levels follow that of the UK (i.e., GCSE, A levels, Undergraduate, Graduate). Our model employs a neural network with 3 hidden layers of width 100, interspersed by rectified linear activations. To ensure

’s are positive, all inputs were exponentiated before being fed into the solver. Figure 5 shows the log loss over the overall game as well as the loss at each individual stage. For comparison, we also provide the results for a player who picks a random action at every stage of the game. The learned parameters for each configuration of features is presented in Figure 6.loss | loss(1) | loss(2) | loss(3) | loss(4) | |
---|---|---|---|---|---|

Uniform | 1.833 | 1.099 | 1.099 | 1.099 | 0.693 |

NoFeat | 1.422 | 0.878 | 0.863 | 0.826 | 0.130 |

Feat | 1.419 | 0.874 | 0.866 | 0.818 | 0.145 |

From Figure 5, we can see that both trained models significantly outperform Uniform. Log losses at each stage appear to decrease with the stage number. This is unsurprising since players behave more rationally (and hence predictably) as more information is revealed. However, our model appears to perform worse at stage , which is in fact a problem with full observability. We suspect this higher loss is a consequence of our model ‘overfitting’ to be overly confident at the final stage, incurring a huge loss on the rare occasion a player answers incorrectly.

Several trends are observed from Figure 6. First, notice that decrease by approximately a factor of between stages. This is fairly expected, since each information gathering leads to other potential states. Also unsurprisingly, better educated respondents exhibit more rationality (recall a lower implies a more rational player). Interestingly, we can see a U-shaped trend in all stages, suggesting that people in the mid twenties and thirties are most rational. Both these observations agree with the findings by [Hunt et al.2016], where it was shown that higher educated and middle aged respondents obtained the most reward.

## 6 Conclusion

In this paper, we substantially improve upon existing work in differentiable game learning. We propose the NLQRE which generalizes QRE and NL models. We also derive gradients for backpropagation and learning, and develop solvers which lead to speedups of several orders of magnitude. Future work include the learning of game structure and extensions for general-sum games.

## References

- [Amin, Singh, and Wellman2016] Amin, K.; Singh, S.; and Wellman, M. P. 2016. Gradient methods for stackelberg security games. In Conference on Uncertainty in Artificial Intelligence, 2–11.
- [Amos and Kolter2017] Amos, B., and Kolter, J. Z. 2017. Optnet: Differentiable optimization as a layer in neural networks. arXiv preprint arXiv:1703.00443.
- [Blum, Haghtalab, and Procaccia2014] Blum, A.; Haghtalab, N.; and Procaccia, A. D. 2014. Learning optimal commitment to overcome insecurity. In Advances in Neural Information Processing Systems, 1826–1834.
- [Chambolle and Pock2016] Chambolle, A., and Pock, T. 2016. On the ergodic convergence rates of a first-order primal–dual algorithm. Mathematical Programming 159(1-2):253–287.
- [Gould et al.2016] Gould, S.; Fernando, B.; Cherian, A.; Anderson, P.; Cruz, R. S.; and Guo, E. 2016. On differentiating parameterized argmin and argmax problems with application to bi-level optimization. arXiv preprint arXiv:1607.05447.
- [Hoda et al.2010] Hoda, S.; Gilpin, A.; Pena, J.; and Sandholm, T. 2010. Smoothing techniques for computing nash equilibria of sequential games. Mathematics of Operations Research 35(2):494–512.
- [Hunt et al.2016] Hunt, L. T.; Rutledge, R. B.; Malalasekera, W. N.; Kennerley, S. W.; and Dolan, R. J. 2016. Approach-induced biases in human information sampling. PLoS biology 14(11):e2000638.
- [Kroer et al.2017] Kroer, C.; Waugh, K.; Kilinc-Karzan, F.; and Sandholm, T. 2017. Theoretical and practical advances on smoothing for extensive-form games. arXiv preprint arXiv:1702.04849.
- [Ling, Fang, and Kolter2018] Ling, C. K.; Fang, F.; and Kolter, J. Z. 2018. What game are we playing? end-to-end learning in normal and extensive form games. arXiv preprint arXiv:1805.02777.
- [Luce2012] Luce, R. D. 2012. Individual choice behavior: A theoretical analysis. Courier Corporation.
- [McKelvey and Palfrey1995] McKelvey, R. D., and Palfrey, T. R. 1995. Quantal response equilibria for normal form games. Games and economic behavior 10(1):6–38.
- [Mertikopoulos and Sandholm2016] Mertikopoulos, P., and Sandholm, W. H. 2016. Learning in games via reinforcement and regularization. Mathematics of Operations Research 41(4):1297–1324.
- [Paszke et al.2017] Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer, A. 2017. Automatic differentiation in pytorch.
- [Thurstone1927] Thurstone, L. L. 1927. A law of comparative judgment. Psychological review 34(4):273.
- [Train2009] Train, K. E. 2009. Discrete choice methods with simulation. Cambridge university press.
- [Tversky1972] Tversky, A. 1972. Elimination by aspects: A theory of choice. Psychological review 79(4):281.
- [Von Stengel1996] Von Stengel, B. 1996. Efficient computation of behavior strategies. Games and Economic Behavior 14(2):220–246.
- [Vorobeychik, Wellman, and Singh2007] Vorobeychik, Y.; Wellman, M. P.; and Singh, S. 2007. Learning payoff functions in infinite games. Mach Learn 67:145–168.
- [Waugh, Ziebart, and Bagnell2011] Waugh, K.; Ziebart, B. D.; and Bagnell, J. A. 2011. Computational rationalization: the inverse equilibrium problem. In Proceedings of the 28th International Conference on International Conference on Machine Learning, 1169–1176. Omnipress.

## Appendix

### Best response of the NLQRE as a nested logit

Here we show that the best-response in the NLQRE, which may contain parallel information sets (either due to chance or actions by other players) may be regarded as a nested logit. The idea is to express each strategy in the reduced normal form into a sequence of decisions, each describing what is to be done in each parallel information set.

Consider the following game in Figure 7. Chance (or the other player), labelled as (2), first chooses out of actions, which is made known to the player. For example, this could be the private cards which are dealt to a player in a game of poker. The nodes and are the subtrees following these actions by the chance player.

Without loss of generality, let and be the tree representation of the player’s strategies in subgame and , i.e., and

are decision trees for the player. Given that chance could have chosen either action to begin with, the pure strategies are the cross product of all strategies between

and , i.e., the player has to account for all possible contingencies. This may be written as a 2-stage decision process in Figure 8, where the first and second stages are choices from and respectively, where is duplicated times, where is the number of possible leaves for .The rewards are additive in each stage, implying that the best response for each of the duplicated trees are identical. (Note however that the actual payoffs are modulated by the probability of the chance player choosing the left or right action, but this factor is identical for each copy of ). Furthermore, the leaves of form a probability vector (since is a decision tree). This implies that setting the rationality parameters for the roots of all copies to be and the rationality parameter at to be yields precisely one-player version of the optimization problem in (2), since the objective in each copy of is identical and their coefficients sum to . Recursively applying this process bottom up to each subtree (i.e., making duplicate copies of subtrees whenever we encounter parallel information sets) gives the desired result.

Observe that each pure strategy (path) in Figure 8 is a pure strategy in reduced normal form. However, each path may pass through different information sets (for example, when there is nesting of actions in the bus example), and hence different parameters. This is in line with what one would expect with nested logits.

### Fast forward and backward pass solvers

In this section, we provide the complete computational details and proofs with regard of how to compute best responses for the forward and backward passes.

#### Forward Pass

For this section, the when referring to Algorithm 2. Setting to be the entropy terms in (2) and gives the expression in the form of (9). The natural divergence to be chosen is the standard entropy divergence adapted to the dilated setting (dropping terms in which do not contain ).

where a similar expression holds for . Plugging into the expression for gives

It is known that, may be solved by a single bottom-up traversal of the game tree and a single sparse matrix-vector multiplication [Hoda et al.2010]. At each information set , we solve for the ‘behavioral’ best response (i.e., assuming that information set was the root). Each of these sub-problems may be conveniently expressed using log-sum-exp and softmax functions. Denoting , we compute

where the constraint that is implicit from the log barrier.

Optimization of the inner summation, along with the relevant part of the inner product may be done in closed form using log-sum-exp. The tree constraints for allows them to perform traversals bottom up. Throughout the traversal process, denote as the ‘value’ of each infoset and as the value of each action.

The behavioral strategies may be expressed using the softmax function. For an action belonging to info set ,

(14) |

The sequence form may be recovered from behavioral strategies using a single downwards traversal of the tree.

#### Backward Pass

The backward pass also requires solving a linear system to obtain . By rewriting the linear system as another min-max problem, we may again apply Algorithm 2. Observe that the solution to the system are precisely the (necessary and sufficient) KKT conditions of the following min-max problem

subject to | (15) |

Note that and are constants in the backwards pass, here we are optimizing over , which are not probabilities. Since and are positive definite, this is of the form required by Algorithm 2. We select the natural distance generating function which yields (ignoring terms containing only ),

Plugging this into the expression for and rearranging gives

(16) |

##### Proposition 1

###### Proof.

To obtain efficient solutions for the best responses, we require the following handy results, which may all be verified by algebraic manipulation.

###### Lemma 2.

, where ’s are column vectors of size equal to number of actions, and contain ’s where is some descendent of information set , and 0 otherwise.

Taking transposes gives the following, is equal to , where the ’s are column vectors of length equal to , and have entries equal to if the index is an ancestor of .

###### Lemma 3.

, i.e. equal to a square matrix of size , with diagonal entries equal to the corresponding to a given information state’s parent action.

###### Lemma 4.

For any vector , may be computed in linear time by traversing the tree bottom-up.

We are now ready to prove the main theorem. Letting in (16) gives

(17) |

which has KKT conditions

(18) | ||||

(19) |

Multiplying by gives

(20) |

Note that we should not have introduced new roots in doing so, since these are linear systems and there is a unique solution to both before and after the multiplication. Applying Lemma 2 and Lemma 3 gives an expression for . Lemma 4, together with the fact that is diagonal implies that may be solved for in linear time. The computation of requires , which may be done in time linear in the size of the extensive form game tree. In the extreme case, the game could be a single-stage simultaneous move game, resulting in being dense. However, for typical EFGs, should be fairly sparse.

With , we may solve for

(21) |

Since is tree-structured, the inversion may be done in linear time using Gaussian Elimination. Similarly, may be computed in linear time because of sparsity in . That is, the number of non-zero elements in is equal to the sum of the number of actions over all information sets (recall that each row in has non-zero entries for the actions for a given information set and its parent). ∎

Comments

There are no comments yet.