From ccf3cbba55b30241a32f06edead25f4a99973c3c Mon Sep 17 00:00:00 2001 From: Thibaut Horel Date: Tue, 28 Jun 2022 14:09:05 -0400 Subject: AOAS 2103-005, second revision --- Makefile | 2 +- aoas-2013-005.tex | 158 --------------------------------------------------- aoas-2103-005-R1.tex | 62 ++++++++++++++++++++ aoas-2103-005.tex | 158 +++++++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 221 insertions(+), 159 deletions(-) delete mode 100644 aoas-2013-005.tex create mode 100644 aoas-2103-005-R1.tex create mode 100644 aoas-2103-005.tex diff --git a/Makefile b/Makefile index 5a43db1..9247bfd 100644 --- a/Makefile +++ b/Makefile @@ -4,7 +4,7 @@ BIB = refs.bib .PHONY: all clean FORCE -all: siopt-2021-140246.pdf +all: aoas-2103-005-R2.pdf %.pdf: FORCE latexrun -W no-xcolor $*.tex diff --git a/aoas-2013-005.tex b/aoas-2013-005.tex deleted file mode 100644 index e9c3922..0000000 --- a/aoas-2013-005.tex +++ /dev/null @@ -1,158 +0,0 @@ -\documentclass[10pt]{article} -\usepackage[T1]{fontenc} -\usepackage[utf8]{inputenc} -\usepackage[hmargin=1in, vmargin=1in]{geometry} -\usepackage{amsmath,amsfonts} - - -\title{\vspace{-2em}\large Review: \emph{A Multi-Agent Reinforcement Learning Framework -for Treatment Effects Evaluation in Two-Sided Markets}} -\author{Submission 2013-005 to the \emph{Journal of Applied Statistics}} -\date{} - -\begin{document} - -\maketitle - -\paragraph{Summary.} -\looseness=-1 -This paper considers the problem of \emph{off-policy} learning in multi-agent -reinforcement learning. The model (Section 2) considers $N$ agents/units -evolving according to a Markov decision process: at each time step $t$, each -agent is assigned a treatment/action in $\{0,1\}$, resulting in a vector of -rewards $R_t\in\mathbb{R}^N$ (one for each agent). An underlying state in state -space $\mathbb{S}$ evolves according to a Markov transition kernel -$\mathcal{P}: \mathbb{S}\times\{0,1\}^N \to \Delta(\mathbb{S})$: given the -state $s_t$ at time step $t$ and vector of treatments $a_t\in\{0,1\}^N$, -$\mathcal{P}(s_t, a_t)$ specifies the probability distribution of the state at -time step $t+1$. Finally, a stationary policy $\pi:\mathbb{S}\to\{0,1\}^N$ -chooses a vector of treatments given an observed state. - -The goal of this paper is to design estimators for the expected average reward -when choosing treatments according to a given policy $\pi$ over $T$ time steps; -the difficulty being that in the observed data, the treatments might differ -from the ones that would have been chosen under the policy $\pi$. The authors -start from a simple importance-sampling based estimator which suffers from -prohibitively large variance due to the exponential size of the action space -$\{0,1\}^N$. To address this problem, they introduce a mean-field approximation -in which the dependency of the reward $r_i$ of agent $i$ on the treatments and -states of the other agents is reduced to a scalar summary, thus significantly -reducing the dimensionality of the problem and resulting in their first -estimator $\hat{V}^{\rm IS}$ (Section 3.1 and 3.3). This estimator is then -combined with a standard $Q$-learning based estimator in a manner known as -\emph{doubly robust} estimation, resulting in their final estimator -$\hat{V}^{\rm DR}$. This way of combining estimators in the context of RL is -sometimes known as double reinforcement learning. The estimator is stated to be -consistent and approximately normal in the appendix, with proofs supplied in -the supplementary material. Finally the estimator is evaluated experimentally -in the context of ride-sharing platforms, first on synthetic data in Section -4 and then on real data in Section 5. - -\paragraph{Scope and contributions.} A major concern I have regarding this paper -is with the way it is currently framed, making it particularly difficult to -appreciate its contributions. Specifically: -\begin{enumerate} - \item The title and abstract mention \emph{two-sided markets}, but nothing - in the formulation is specific to two-sided markets, since the problem - is modeled at the level of a spatial unit (a geographical region in the - example of ride-sharing platforms) in which a single state variable - abstracts away all the details of both sides of the market. When - I first read the paper, I was initially expecting to see both sides of - the market—consumers and service providers—being modeled separately as - two coupled Markov decision processes. Instead, this paper deals with - a generic multi-agent reinforcement learning problem and ride-sharing - platforms only appear in the evaluation (4 out of 28 pages in total). - \item The title and language in sections 1 and 2 use terms from the causal - inference literature, such as treatment effect and potential outcomes. - But once the ATE is defined as the difference of the value of two - policies, it becomes clear that the problem is exactly the one of - \emph{off-policy evaluation} in reinforcement learning. Hence, the - paper has little to do with causal inference and is firmly anchored in - the reinforcement learning literature by building from recent results - in this area. - \item The model is described as a multi-agent one, but it could be - equivalently described with a single agent whose action space is - $\{0,1\}^N$ and reward is the sum of the rewards of the agents. Hence - the problem is not as much about multi-agent as about dealing with an - exponentially large action space: \emph{this should be mentioned - prominently}. It is however true that the main assumption driving all - the results and methods, the \emph{mean-field approximation}, is more - naturally stated using the perspective of multiple agents whose rewards - only depend on a scalar summary of the actions and states of their - neighbors. -\end{enumerate} - -Following the above observations, I believe a much more accurate title for the -paper would be: \emph{Off-policy valuation estimation for multi-agent -reinforcement learning in the mean-field regime}. It also becomes easier to -appreciate the main contribution of this paper: the introduction of -a mean-field approximation to circumvent the high-dimensionality of the action -space. - -\paragraph{Major comments.} The understanding of the paper's scope coming from -the previous paragraph raises the following concerns: -\begin{itemize} - \looseness=-1 - \item given the importance of the mean-field approximation in this paper, - it is surprising that it is not discussed more. Is it possible to test - from data the extent to which it holds? If so, how? Can experimental - evidence for its validity be provided in the data used in the - evaluation sections (4 and 5)? - \item related to the previous point: I didn't find any discussion of how to - choose the mean-field functions $m_i$ in practice. The evaluation - sections do not seem to mention how these functions where chosen. - \item once the mean-field approximation is introduced, the problem is - effectively reduced to a low-dimensional reinforcement learning problem - and the methodological contribution (and theoretical analysis) seem to - follow from an almost routine adaptation of previous papers. If it is - not the case, the paper should do a better job at describing what is - novel in the adaption of these previous methods. - \item the evaluation section mentions that a comparison is made with the - DR-NM method, but it does not appear anywhere on the plots reporting - the MSE (only the DR-NS method appears). - \item given that the $\hat V^{\rm DR}$ estimator crucially uses the - regularized policy iteration estimator from Farahmand et al. (2016) and - Lioa et al. (2020) (by combining it with the $\hat{V}^{\rm IS}$ - estimator), I believe this estimator \emph{by itself} should also be - used as a baseline in the evaluation. - \item the code and synthetic data used for the evaluation should be - provided. -\end{itemize} - -\paragraph{Other comments.}\begin{itemize} - \item in the proof of Theorem 3 in the supplementary material, last line of - page 5, the union bound guarantees that the last inequality holds with - probability at least $1-O(N^{-1} T^{-2})$ and not $1-O(N^{-1}T^{-1})$, - if I am not mistaken. This does not change the conclusion of the - theorem. - \item can the (CMIA) assumption on 6 be thought of as a kind of Markovian - assumption? If I am not mistaken it is weaker than saying that - $R_{i,t}$ is independent of $(A_j, R_j, S_j)_{0\leq j