\documentclass[10pt]{article} \usepackage[T1]{fontenc} \usepackage[utf8]{inputenc} \usepackage[hmargin=1in, vmargin=1in]{geometry} \usepackage{amsmath,amsfonts} \title{\vspace{-2em}\large Review: \emph{A Multi-Agent Reinforcement Learning Framework for Treatment Effects Evaluation in Two-Sided Markets}} \author{Submission 2013-005 to the \emph{Journal of Applied Statistics}} \date{} \begin{document} \maketitle \paragraph{Summary.} \looseness=-1 This paper considers the problem of \emph{off-policy} learning in multi-agent reinforcement learning. The model (Section 2) considers $N$ agents/units evolving according to a Markov decision process: at each time step $t$, each agent is assigned a treatment/action in $\{0,1\}$, resulting in a vector of rewards $R_t\in\mathbb{R}^N$ (one for each agent). An underlying state in state space $\mathbb{S}$ evolves according to a Markov transition kernel $\mathcal{P}: \mathbb{S}\times\{0,1\}^N \to \Delta(\mathbb{S})$: given the state $s_t$ at time step $t$ and vector of treatments $a_t\in\{0,1\}^N$, $\mathcal{P}(s_t, a_t)$ specifies the probability distribution of the state at time step $t+1$. Finally, a stationary policy $\pi:\mathbb{S}\to\{0,1\}^N$ chooses a vector of treatments given an observed state. The goal of this paper is to design estimators for the expected average reward when choosing treatments according to a given policy $\pi$ over $T$ time steps; the difficulty being that in the observed data, the treatments might differ from the ones that would have been chosen under the policy $\pi$. The authors start from a simple importance-sampling based estimator which suffers from prohibitively large variance due to the exponential size of the action space $\{0,1\}^N$. To address this problem, they introduce a mean-field approximation in which the dependency of the reward $r_i$ of agent $i$ on the treatments and states of the other agents is reduced to a scalar summary, thus significantly reducing the dimensionality of the problem and resulting in their first estimator $\hat{V}^{\rm IS}$ (Section 3.1 and 3.3). This estimator is then combined with a standard $Q$-learning based estimator in a manner known as \emph{doubly robust} estimation, resulting in their final estimator $\hat{V}^{\rm DR}$. This way of combining estimators in the context of RL is sometimes known as double reinforcement learning. The estimator is stated to be consistent and approximately normal in the appendix, with proofs supplied in the supplementary material. Finally the estimator is evaluated experimentally in the context of ride-sharing platforms, first on synthetic data in Section 4 and then on real data in Section 5. \paragraph{Scope and contributions.} A major concern I have regarding this paper is with the way it is currently framed, making it particularly difficult to appreciate its contributions. Specifically: \begin{enumerate} \item The title and abstract mention \emph{two-sided markets}, but nothing in the formulation is specific to two-sided markets, since the problem is modeled at the level of a spatial unit (a geographical region in the example of ride-sharing platforms) in which a single state variable abstracts away all the details of both sides of the market. When I first read the paper, I was initially expecting to see both sides of the market—consumers and service providers—being modeled separately as two coupled Markov decision processes. Instead, this paper deals with a generic multi-agent reinforcement learning problem and ride-sharing platforms only appear in the evaluation (4 out of 28 pages in total). \item The title and language in sections 1 and 2 use terms from the causal inference literature, such as treatment effect and potential outcomes. But once the ATE is defined as the difference of the value of two policies, it becomes clear that the problem is exactly the one of \emph{off-policy evaluation} in reinforcement learning. Hence, the paper has little to do with causal inference and is firmly anchored in the reinforcement learning literature by building from recent results in this area. \item The model is described as a multi-agent one, but it could be equivalently described with a single agent whose action space is $\{0,1\}^N$ and reward is the sum of the rewards of the agents. Hence the problem is not as much about multi-agent as about dealing with an exponentially large action space: \emph{this should be mentioned prominently}. It is however true that the main assumption driving all the results and methods, the \emph{mean-field approximation}, is more naturally stated using the perspective of multiple agents whose rewards only depend on a scalar summary of the actions and states of their neighbors. \end{enumerate} Following the above observations, I believe a much more accurate title for the paper would be: \emph{Off-policy valuation estimation for multi-agent reinforcement learning in the mean-field regime}. It also becomes easier to appreciate the main contribution of this paper: the introduction of a mean-field approximation to circumvent the high-dimensionality of the action space. \paragraph{Major comments.} The understanding of the paper's scope coming from the previous paragraph raises the following concerns: \begin{itemize} \looseness=-1 \item given the importance of the mean-field approximation in this paper, it is surprising that it is not discussed more. Is it possible to test from data the extent to which it holds? If so, how? Can experimental evidence for its validity be provided in the data used in the evaluation sections (4 and 5)? \item related to the previous point: I didn't find any discussion of how to choose the mean-field functions $m_i$ in practice. The evaluation sections do not seem to mention how these functions where chosen. \item once the mean-field approximation is introduced, the problem is effectively reduced to a low-dimensional reinforcement learning problem and the methodological contribution (and theoretical analysis) seem to follow from an almost routine adaptation of previous papers. If it is not the case, the paper should do a better job at describing what is novel in the adaption of these previous methods. \item the evaluation section mentions that a comparison is made with the DR-NM method, but it does not appear anywhere on the plots reporting the MSE (only the DR-NS method appears). \item given that the $\hat V^{\rm DR}$ estimator crucially uses the regularized policy iteration estimator from Farahmand et al. (2016) and Lioa et al. (2020) (by combining it with the $\hat{V}^{\rm IS}$ estimator), I believe this estimator \emph{by itself} should also be used as a baseline in the evaluation. \item the code and synthetic data used for the evaluation should be provided. \end{itemize} \paragraph{Other comments.}\begin{itemize} \item in the proof of Theorem 3 in the supplementary material, last line of page 5, the union bound guarantees that the last inequality holds with probability at least $1-O(N^{-1} T^{-2})$ and not $1-O(N^{-1}T^{-1})$, if I am not mistaken. This does not change the conclusion of the theorem. \item can the (CMIA) assumption on 6 be thought of as a kind of Markovian assumption? If I am not mistaken it is weaker than saying that $R_{i,t}$ is independent of $(A_j, R_j, S_j)_{0\leq j