\documentclass[10pt]{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[hmargin=1in, vmargin=1in]{geometry}
\usepackage{amsmath,amsfonts}


\title{\vspace{-2em}\large Review: \emph{A Multi-Agent Reinforcement Learning Framework
for Treatment Effects Evaluation in Two-Sided Markets}}
\author{Submission 2013-005 to the \emph{Journal of Applied Statistics}}
\date{}

\begin{document}

\maketitle

\paragraph{Summary.}
\looseness=-1
This paper considers the problem of \emph{off-policy} learning in multi-agent
reinforcement learning. The model (Section 2) considers $N$ agents/units
evolving according to a Markov decision process: at each time step $t$, each
agent is assigned a treatment/action in $\{0,1\}$, resulting in a vector of
rewards $R_t\in\mathbb{R}^N$ (one for each agent). An underlying state in state
space $\mathbb{S}$ evolves according to a Markov transition kernel
$\mathcal{P}: \mathbb{S}\times\{0,1\}^N \to \Delta(\mathbb{S})$: given the
state $s_t$ at time step $t$ and vector of treatments $a_t\in\{0,1\}^N$,
$\mathcal{P}(s_t, a_t)$ specifies the probability distribution of the state at
time step $t+1$.  Finally, a stationary policy $\pi:\mathbb{S}\to\{0,1\}^N$
chooses a vector of treatments given an observed state.

The goal of this paper is to design estimators for the expected average reward
when choosing treatments according to a given policy $\pi$ over $T$ time steps;
the difficulty being that in the observed data, the treatments might differ
from the ones that would have been chosen under the policy $\pi$. The authors
start from a simple importance-sampling based estimator which suffers from
prohibitively large variance due to the exponential size of the action space
$\{0,1\}^N$. To address this problem, they introduce a mean-field approximation
in which the dependency of the reward $r_i$ of agent $i$ on the treatments and
states of the other agents is reduced to a scalar summary, thus significantly
reducing the dimensionality of the problem and resulting in their first
estimator $\hat{V}^{\rm IS}$ (Section 3.1 and 3.3). This estimator is then
combined with a standard $Q$-learning based estimator in a manner known as
\emph{doubly robust} estimation, resulting in their final estimator
$\hat{V}^{\rm DR}$. This way of combining estimators in the context of RL is
sometimes known as double reinforcement learning. The estimator is stated to be
consistent and approximately normal in the appendix, with proofs supplied in
the supplementary material. Finally the estimator is evaluated experimentally
in the context of ride-sharing platforms, first on synthetic data in Section
4 and then on real data in Section 5.

\paragraph{Scope and contributions.} A major concern I have regarding this paper
is with the way it is currently framed, making it particularly difficult to
appreciate its contributions. Specifically:
\begin{enumerate}
	\item The title and abstract mention \emph{two-sided markets}, but nothing
		in the formulation is specific to two-sided markets, since the problem
		is modeled at the level of a spatial unit (a geographical region in the
		example of ride-sharing platforms) in which a single state variable
		abstracts away all the details of both sides of the market. When
		I first read the paper, I was initially expecting to see both sides of
		the market—consumers and service providers—being modeled separately as
		two coupled Markov decision processes. Instead, this paper deals with
		a generic multi-agent reinforcement learning problem and ride-sharing
		platforms only appear in the evaluation (4 out of 28 pages in total).
	\item The title and language in sections 1 and 2 use terms from the causal
		inference literature, such as treatment effect and potential outcomes.
		But once the ATE is defined as the difference of the value of two
		policies, it becomes clear that the problem is exactly the one of
		\emph{off-policy evaluation} in reinforcement learning. Hence, the
		paper has little to do with causal inference and is firmly anchored in
		the reinforcement learning literature by building from recent results
		in this area.
	\item The model is described as a multi-agent one, but it could be
		equivalently described with a single agent whose action space is
		$\{0,1\}^N$ and reward is the sum of the rewards of the agents. Hence
		the problem is not as much about multi-agent as about dealing with an
		exponentially large action space: \emph{this should be mentioned
		prominently}. It is however true that the main assumption driving all
		the results and methods, the \emph{mean-field approximation}, is more
		naturally stated using the perspective of multiple agents whose rewards
		only depend on a scalar summary of the actions and states of their
		neighbors.
\end{enumerate}

Following the above observations, I believe a much more accurate title for the
paper would be: \emph{Off-policy valuation estimation for multi-agent
reinforcement learning in the mean-field regime}. It also becomes easier to
appreciate the main contribution of this paper: the introduction of
a mean-field approximation to circumvent the high-dimensionality of the action
space.

\paragraph{Major comments.} The understanding of the paper's scope coming from
the previous paragraph raises the following concerns:
\begin{itemize}
	\looseness=-1
	\item given the importance of the mean-field approximation in this paper,
		it is surprising that it is not discussed more. Is it possible to test
		from data the extent to which it holds? If so, how? Can experimental
		evidence for its validity be provided in the data used in the
		evaluation sections (4 and 5)?
	\item related to the previous point: I didn't find any discussion of how to
		choose the mean-field functions $m_i$ in practice. The evaluation
		sections do not seem to mention how these functions where chosen.
	\item once the mean-field approximation is introduced, the problem is
		effectively reduced to a low-dimensional reinforcement learning problem
		and the methodological contribution (and theoretical analysis) seem to
		follow from an almost routine adaptation of previous papers. If it is
		not the case, the paper should do a better job at describing what is
		novel in the adaption of these previous methods.
	\item the evaluation section mentions that a comparison is made with the
		DR-NM method, but it does not appear anywhere on the plots reporting
		the MSE (only the DR-NS method appears).
	\item given that the $\hat V^{\rm DR}$ estimator crucially uses the
		regularized policy iteration estimator from Farahmand et al. (2016) and
		Lioa et al. (2020) (by combining it with the $\hat{V}^{\rm IS}$
		estimator), I believe this estimator \emph{by itself} should also be
		used as a baseline in the evaluation.
	\item the code and synthetic data used for the evaluation should be
		provided.
\end{itemize}

\paragraph{Other comments.}\begin{itemize}
	\item in the proof of Theorem 3 in the supplementary material, last line of
		page 5, the union bound guarantees that the last inequality holds with
		probability at least $1-O(N^{-1} T^{-2})$ and not $1-O(N^{-1}T^{-1})$,
		if I am not mistaken. This does not change the conclusion of the
		theorem.
	\item can the (CMIA) assumption on 6 be thought of as a kind of Markovian
		assumption? If I am not mistaken it is weaker than saying that
		$R_{i,t}$ is independent of $(A_j, R_j, S_j)_{0\leq j<t}$ conditioned
		on $A_t, S_t$. Maybe this should be stated to provide more intuition.
	\item notations $r_i^*$ and $Q_i^*$ in the mean-field approximation are
		somewhat confusing since they conflict with the “starred” notation for
		potential outcomes.
\end{itemize}

\paragraph{Typos.}
\begin{itemize}
	\item page 7, third paragraph: \emph{important ratio} $\to$
		\emph{importance ratio}.
	\item page 10, Algorithm 1: \emph{Initial} $\to$ \emph{Initialize}.
	\item the references need to be carefully checked for formatting. In
		particular, Markov and Fisher are consistently spelled with
		a lowercase initial letter.
\end{itemize}

\paragraph{Recommendation.}
I believe that the problems about the framing of the paper described in
\emph{Scope and contributions} warrant at least a major revision. Furthermore,
the concerns raised in \emph{Major comments} suggest that both the
methodological and experimental contributions are somewhat limited and might
justify a rejection. It could however be that the whole is more than the sum of
its parts, given the relevance and timeliness of the application.

%\bibliographystyle{plain}
%\bibliography{main}

\end{document}