From ccf3cbba55b30241a32f06edead25f4a99973c3c Mon Sep 17 00:00:00 2001
From: Thibaut Horel <thibaut.horel@gmail.com>
Date: Tue, 28 Jun 2022 14:09:05 -0400
Subject: AOAS 2103-005, second revision

---
 Makefile             |   2 +-
 aoas-2013-005.tex    | 158 ---------------------------------------------------
 aoas-2103-005-R1.tex |  62 ++++++++++++++++++++
 aoas-2103-005.tex    | 158 +++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 221 insertions(+), 159 deletions(-)
 delete mode 100644 aoas-2013-005.tex
 create mode 100644 aoas-2103-005-R1.tex
 create mode 100644 aoas-2103-005.tex

diff --git a/Makefile b/Makefile
index 5a43db1..9247bfd 100644
--- a/Makefile
+++ b/Makefile
@@ -4,7 +4,7 @@ BIB = refs.bib
 
 .PHONY: all clean FORCE
 
-all: siopt-2021-140246.pdf
+all: aoas-2103-005-R2.pdf
 
 %.pdf: FORCE
 	latexrun -W no-xcolor $*.tex
diff --git a/aoas-2013-005.tex b/aoas-2013-005.tex
deleted file mode 100644
index e9c3922..0000000
--- a/aoas-2013-005.tex
+++ /dev/null
@@ -1,158 +0,0 @@
-\documentclass[10pt]{article}
-\usepackage[T1]{fontenc}
-\usepackage[utf8]{inputenc}
-\usepackage[hmargin=1in, vmargin=1in]{geometry}
-\usepackage{amsmath,amsfonts}
-
-
-\title{\vspace{-2em}\large Review: \emph{A Multi-Agent Reinforcement Learning Framework
-for Treatment Effects Evaluation in Two-Sided Markets}}
-\author{Submission 2013-005 to the \emph{Journal of Applied Statistics}}
-\date{}
-
-\begin{document}
-
-\maketitle
-
-\paragraph{Summary.}
-\looseness=-1
-This paper considers the problem of \emph{off-policy} learning in multi-agent
-reinforcement learning. The model (Section 2) considers $N$ agents/units
-evolving according to a Markov decision process: at each time step $t$, each
-agent is assigned a treatment/action in $\{0,1\}$, resulting in a vector of
-rewards $R_t\in\mathbb{R}^N$ (one for each agent). An underlying state in state
-space $\mathbb{S}$ evolves according to a Markov transition kernel
-$\mathcal{P}: \mathbb{S}\times\{0,1\}^N \to \Delta(\mathbb{S})$: given the
-state $s_t$ at time step $t$ and vector of treatments $a_t\in\{0,1\}^N$,
-$\mathcal{P}(s_t, a_t)$ specifies the probability distribution of the state at
-time step $t+1$.  Finally, a stationary policy $\pi:\mathbb{S}\to\{0,1\}^N$
-chooses a vector of treatments given an observed state.
-
-The goal of this paper is to design estimators for the expected average reward
-when choosing treatments according to a given policy $\pi$ over $T$ time steps;
-the difficulty being that in the observed data, the treatments might differ
-from the ones that would have been chosen under the policy $\pi$. The authors
-start from a simple importance-sampling based estimator which suffers from
-prohibitively large variance due to the exponential size of the action space
-$\{0,1\}^N$. To address this problem, they introduce a mean-field approximation
-in which the dependency of the reward $r_i$ of agent $i$ on the treatments and
-states of the other agents is reduced to a scalar summary, thus significantly
-reducing the dimensionality of the problem and resulting in their first
-estimator $\hat{V}^{\rm IS}$ (Section 3.1 and 3.3). This estimator is then
-combined with a standard $Q$-learning based estimator in a manner known as
-\emph{doubly robust} estimation, resulting in their final estimator
-$\hat{V}^{\rm DR}$. This way of combining estimators in the context of RL is
-sometimes known as double reinforcement learning. The estimator is stated to be
-consistent and approximately normal in the appendix, with proofs supplied in
-the supplementary material. Finally the estimator is evaluated experimentally
-in the context of ride-sharing platforms, first on synthetic data in Section
-4 and then on real data in Section 5.
-
-\paragraph{Scope and contributions.} A major concern I have regarding this paper
-is with the way it is currently framed, making it particularly difficult to
-appreciate its contributions. Specifically:
-\begin{enumerate}
-	\item The title and abstract mention \emph{two-sided markets}, but nothing
-		in the formulation is specific to two-sided markets, since the problem
-		is modeled at the level of a spatial unit (a geographical region in the
-		example of ride-sharing platforms) in which a single state variable
-		abstracts away all the details of both sides of the market. When
-		I first read the paper, I was initially expecting to see both sides of
-		the market—consumers and service providers—being modeled separately as
-		two coupled Markov decision processes. Instead, this paper deals with
-		a generic multi-agent reinforcement learning problem and ride-sharing
-		platforms only appear in the evaluation (4 out of 28 pages in total).
-	\item The title and language in sections 1 and 2 use terms from the causal
-		inference literature, such as treatment effect and potential outcomes.
-		But once the ATE is defined as the difference of the value of two
-		policies, it becomes clear that the problem is exactly the one of
-		\emph{off-policy evaluation} in reinforcement learning. Hence, the
-		paper has little to do with causal inference and is firmly anchored in
-		the reinforcement learning literature by building from recent results
-		in this area.
-	\item The model is described as a multi-agent one, but it could be
-		equivalently described with a single agent whose action space is
-		$\{0,1\}^N$ and reward is the sum of the rewards of the agents. Hence
-		the problem is not as much about multi-agent as about dealing with an
-		exponentially large action space: \emph{this should be mentioned
-		prominently}. It is however true that the main assumption driving all
-		the results and methods, the \emph{mean-field approximation}, is more
-		naturally stated using the perspective of multiple agents whose rewards
-		only depend on a scalar summary of the actions and states of their
-		neighbors.
-\end{enumerate}
-
-Following the above observations, I believe a much more accurate title for the
-paper would be: \emph{Off-policy valuation estimation for multi-agent
-reinforcement learning in the mean-field regime}. It also becomes easier to
-appreciate the main contribution of this paper: the introduction of
-a mean-field approximation to circumvent the high-dimensionality of the action
-space.
-
-\paragraph{Major comments.} The understanding of the paper's scope coming from
-the previous paragraph raises the following concerns:
-\begin{itemize}
-	\looseness=-1
-	\item given the importance of the mean-field approximation in this paper,
-		it is surprising that it is not discussed more. Is it possible to test
-		from data the extent to which it holds? If so, how? Can experimental
-		evidence for its validity be provided in the data used in the
-		evaluation sections (4 and 5)?
-	\item related to the previous point: I didn't find any discussion of how to
-		choose the mean-field functions $m_i$ in practice. The evaluation
-		sections do not seem to mention how these functions where chosen.
-	\item once the mean-field approximation is introduced, the problem is
-		effectively reduced to a low-dimensional reinforcement learning problem
-		and the methodological contribution (and theoretical analysis) seem to
-		follow from an almost routine adaptation of previous papers. If it is
-		not the case, the paper should do a better job at describing what is
-		novel in the adaption of these previous methods.
-	\item the evaluation section mentions that a comparison is made with the
-		DR-NM method, but it does not appear anywhere on the plots reporting
-		the MSE (only the DR-NS method appears).
-	\item given that the $\hat V^{\rm DR}$ estimator crucially uses the
-		regularized policy iteration estimator from Farahmand et al. (2016) and
-		Lioa et al. (2020) (by combining it with the $\hat{V}^{\rm IS}$
-		estimator), I believe this estimator \emph{by itself} should also be
-		used as a baseline in the evaluation.
-	\item the code and synthetic data used for the evaluation should be
-		provided.
-\end{itemize}
-
-\paragraph{Other comments.}\begin{itemize}
-	\item in the proof of Theorem 3 in the supplementary material, last line of
-		page 5, the union bound guarantees that the last inequality holds with
-		probability at least $1-O(N^{-1} T^{-2})$ and not $1-O(N^{-1}T^{-1})$,
-		if I am not mistaken. This does not change the conclusion of the
-		theorem.
-	\item can the (CMIA) assumption on 6 be thought of as a kind of Markovian
-		assumption? If I am not mistaken it is weaker than saying that
-		$R_{i,t}$ is independent of $(A_j, R_j, S_j)_{0\leq j<t}$ conditioned
-		on $A_t, S_t$. Maybe this should be stated to provide more intuition.
-	\item notations $r_i^*$ and $Q_i^*$ in the mean-field approximation are
-		somewhat confusing since they conflict with the “starred” notation for
-		potential outcomes.
-\end{itemize}
-
-\paragraph{Typos.}
-\begin{itemize}
-	\item page 7, third paragraph: \emph{important ratio} $\to$
-		\emph{importance ratio}.
-	\item page 10, Algorithm 1: \emph{Initial} $\to$ \emph{Initialize}.
-	\item the references need to be carefully checked for formatting. In
-		particular, Markov and Fisher are consistently spelled with
-		a lowercase initial letter.
-\end{itemize}
-
-\paragraph{Recommendation.}
-I believe that the problems about the framing of the paper described in
-\emph{Scope and contributions} warrant at least a major revision. Furthermore,
-the concerns raised in \emph{Major comments} suggest that both the
-methodological and experimental contributions are somewhat limited and might
-justify a rejection. It could however be that the whole is more than the sum of
-its parts, given the relevance and timeliness of the application.
-
-%\bibliographystyle{plain}
-%\bibliography{main}
-
-\end{document}
diff --git a/aoas-2103-005-R1.tex b/aoas-2103-005-R1.tex
new file mode 100644
index 0000000..c021c31
--- /dev/null
+++ b/aoas-2103-005-R1.tex
@@ -0,0 +1,62 @@
+\documentclass[11pt]{article}
+\usepackage[T1]{fontenc}
+\usepackage[utf8]{inputenc}
+\usepackage[hmargin=1.2in, vmargin=1.2in]{geometry}
+\usepackage{amsmath,amsfonts}
+
+
+\title{\vspace{-2em}\large Review of \emph{A Multi-Agent Reinforcement Learning
+		Framework\\ for Off-Policy Evaluation in Two-Sided Markets}}
+		\author{Submission \textsf{2103--005} to the \emph{Annals of Applied Statistics},
+1\textsuperscript{st} revision}
+
+\begin{document}
+
+\maketitle
+
+I would like to thank the authors for carefully addressing all my comments in
+their response and revision. In particular, I noted significant improvements
+regarding the framing and contributions of the paper in Section 1, and
+clarifications regarding the setup and assumptions in Section 2. I also
+appreciate the improved discussion on the mean-field approximation and how to
+test it empirically using a conditional independence test. I only have two
+minor comments:
+\begin{itemize}
+	\item  I believe there a few inaccuracies in the derivation in Appendix C:
+		\begin{itemize}
+			\item in the first displayed expression for $Q_i^{\bf\pi}({\bf a},{\bf
+				s})$, the parenthesis after $m_i^a({\bf A_t})$ should instead be
+				placed after $\tilde S_{i,t}$. Same remark about the inlined
+				expression on the next line.
+			\item the sequence of equalities on page 25 should be stated for an
+				arbitrary square-integrable function $h$ instead of $\bar r_i$
+				to be consistent with the induction hypothesis.
+			\item in the same sequence of equalities, the last equality should
+				have $m_i^a(\pi({\bf S}_{j}))$ instead of $m_i^a(\pi({\bf
+				S}_{j+1}))$ in the “conditioning” part (after the ‘$|$’
+				character).
+		\end{itemize}
+	\item In Appendix E in the supplementary material, I suggest adding plots
+		similar to the ones in Fig. 4 of the main file, but only showing the
+		two curves corresponding to the proposed method and the QV baseline.
+		This is because these two curves are so close together in the current
+		Fig.\ 4 that it is hard to tell them apart due to the scale imposed by the
+		  other curves. My hope is that restricting to these two curves only in
+		  the appendix will allow for a better scale on the $y$-axis that will
+		  make them easier to distinguish and provide visual support for the
+		  $t$-test reported in Table 1 and 2.
+\end{itemize}
+I recommend acceptance after the above minor points are addressed.
+
+\vspace{1em}
+
+\paragraph{Additional typos.}
+\begin{itemize}
+	\item page 4, second paragraph: \emph{Event through} $\to$ \emph{Even though}
+	\item the running title at the top of each page does not match the new
+		paper title.
+\end{itemize}
+%\bibliographystyle{plain}
+%\bibliography{main}
+
+\end{document}
diff --git a/aoas-2103-005.tex b/aoas-2103-005.tex
new file mode 100644
index 0000000..e9c3922
--- /dev/null
+++ b/aoas-2103-005.tex
@@ -0,0 +1,158 @@
+\documentclass[10pt]{article}
+\usepackage[T1]{fontenc}
+\usepackage[utf8]{inputenc}
+\usepackage[hmargin=1in, vmargin=1in]{geometry}
+\usepackage{amsmath,amsfonts}
+
+
+\title{\vspace{-2em}\large Review: \emph{A Multi-Agent Reinforcement Learning Framework
+for Treatment Effects Evaluation in Two-Sided Markets}}
+\author{Submission 2013-005 to the \emph{Journal of Applied Statistics}}
+\date{}
+
+\begin{document}
+
+\maketitle
+
+\paragraph{Summary.}
+\looseness=-1
+This paper considers the problem of \emph{off-policy} learning in multi-agent
+reinforcement learning. The model (Section 2) considers $N$ agents/units
+evolving according to a Markov decision process: at each time step $t$, each
+agent is assigned a treatment/action in $\{0,1\}$, resulting in a vector of
+rewards $R_t\in\mathbb{R}^N$ (one for each agent). An underlying state in state
+space $\mathbb{S}$ evolves according to a Markov transition kernel
+$\mathcal{P}: \mathbb{S}\times\{0,1\}^N \to \Delta(\mathbb{S})$: given the
+state $s_t$ at time step $t$ and vector of treatments $a_t\in\{0,1\}^N$,
+$\mathcal{P}(s_t, a_t)$ specifies the probability distribution of the state at
+time step $t+1$.  Finally, a stationary policy $\pi:\mathbb{S}\to\{0,1\}^N$
+chooses a vector of treatments given an observed state.
+
+The goal of this paper is to design estimators for the expected average reward
+when choosing treatments according to a given policy $\pi$ over $T$ time steps;
+the difficulty being that in the observed data, the treatments might differ
+from the ones that would have been chosen under the policy $\pi$. The authors
+start from a simple importance-sampling based estimator which suffers from
+prohibitively large variance due to the exponential size of the action space
+$\{0,1\}^N$. To address this problem, they introduce a mean-field approximation
+in which the dependency of the reward $r_i$ of agent $i$ on the treatments and
+states of the other agents is reduced to a scalar summary, thus significantly
+reducing the dimensionality of the problem and resulting in their first
+estimator $\hat{V}^{\rm IS}$ (Section 3.1 and 3.3). This estimator is then
+combined with a standard $Q$-learning based estimator in a manner known as
+\emph{doubly robust} estimation, resulting in their final estimator
+$\hat{V}^{\rm DR}$. This way of combining estimators in the context of RL is
+sometimes known as double reinforcement learning. The estimator is stated to be
+consistent and approximately normal in the appendix, with proofs supplied in
+the supplementary material. Finally the estimator is evaluated experimentally
+in the context of ride-sharing platforms, first on synthetic data in Section
+4 and then on real data in Section 5.
+
+\paragraph{Scope and contributions.} A major concern I have regarding this paper
+is with the way it is currently framed, making it particularly difficult to
+appreciate its contributions. Specifically:
+\begin{enumerate}
+	\item The title and abstract mention \emph{two-sided markets}, but nothing
+		in the formulation is specific to two-sided markets, since the problem
+		is modeled at the level of a spatial unit (a geographical region in the
+		example of ride-sharing platforms) in which a single state variable
+		abstracts away all the details of both sides of the market. When
+		I first read the paper, I was initially expecting to see both sides of
+		the market—consumers and service providers—being modeled separately as
+		two coupled Markov decision processes. Instead, this paper deals with
+		a generic multi-agent reinforcement learning problem and ride-sharing
+		platforms only appear in the evaluation (4 out of 28 pages in total).
+	\item The title and language in sections 1 and 2 use terms from the causal
+		inference literature, such as treatment effect and potential outcomes.
+		But once the ATE is defined as the difference of the value of two
+		policies, it becomes clear that the problem is exactly the one of
+		\emph{off-policy evaluation} in reinforcement learning. Hence, the
+		paper has little to do with causal inference and is firmly anchored in
+		the reinforcement learning literature by building from recent results
+		in this area.
+	\item The model is described as a multi-agent one, but it could be
+		equivalently described with a single agent whose action space is
+		$\{0,1\}^N$ and reward is the sum of the rewards of the agents. Hence
+		the problem is not as much about multi-agent as about dealing with an
+		exponentially large action space: \emph{this should be mentioned
+		prominently}. It is however true that the main assumption driving all
+		the results and methods, the \emph{mean-field approximation}, is more
+		naturally stated using the perspective of multiple agents whose rewards
+		only depend on a scalar summary of the actions and states of their
+		neighbors.
+\end{enumerate}
+
+Following the above observations, I believe a much more accurate title for the
+paper would be: \emph{Off-policy valuation estimation for multi-agent
+reinforcement learning in the mean-field regime}. It also becomes easier to
+appreciate the main contribution of this paper: the introduction of
+a mean-field approximation to circumvent the high-dimensionality of the action
+space.
+
+\paragraph{Major comments.} The understanding of the paper's scope coming from
+the previous paragraph raises the following concerns:
+\begin{itemize}
+	\looseness=-1
+	\item given the importance of the mean-field approximation in this paper,
+		it is surprising that it is not discussed more. Is it possible to test
+		from data the extent to which it holds? If so, how? Can experimental
+		evidence for its validity be provided in the data used in the
+		evaluation sections (4 and 5)?
+	\item related to the previous point: I didn't find any discussion of how to
+		choose the mean-field functions $m_i$ in practice. The evaluation
+		sections do not seem to mention how these functions where chosen.
+	\item once the mean-field approximation is introduced, the problem is
+		effectively reduced to a low-dimensional reinforcement learning problem
+		and the methodological contribution (and theoretical analysis) seem to
+		follow from an almost routine adaptation of previous papers. If it is
+		not the case, the paper should do a better job at describing what is
+		novel in the adaption of these previous methods.
+	\item the evaluation section mentions that a comparison is made with the
+		DR-NM method, but it does not appear anywhere on the plots reporting
+		the MSE (only the DR-NS method appears).
+	\item given that the $\hat V^{\rm DR}$ estimator crucially uses the
+		regularized policy iteration estimator from Farahmand et al. (2016) and
+		Lioa et al. (2020) (by combining it with the $\hat{V}^{\rm IS}$
+		estimator), I believe this estimator \emph{by itself} should also be
+		used as a baseline in the evaluation.
+	\item the code and synthetic data used for the evaluation should be
+		provided.
+\end{itemize}
+
+\paragraph{Other comments.}\begin{itemize}
+	\item in the proof of Theorem 3 in the supplementary material, last line of
+		page 5, the union bound guarantees that the last inequality holds with
+		probability at least $1-O(N^{-1} T^{-2})$ and not $1-O(N^{-1}T^{-1})$,
+		if I am not mistaken. This does not change the conclusion of the
+		theorem.
+	\item can the (CMIA) assumption on 6 be thought of as a kind of Markovian
+		assumption? If I am not mistaken it is weaker than saying that
+		$R_{i,t}$ is independent of $(A_j, R_j, S_j)_{0\leq j<t}$ conditioned
+		on $A_t, S_t$. Maybe this should be stated to provide more intuition.
+	\item notations $r_i^*$ and $Q_i^*$ in the mean-field approximation are
+		somewhat confusing since they conflict with the “starred” notation for
+		potential outcomes.
+\end{itemize}
+
+\paragraph{Typos.}
+\begin{itemize}
+	\item page 7, third paragraph: \emph{important ratio} $\to$
+		\emph{importance ratio}.
+	\item page 10, Algorithm 1: \emph{Initial} $\to$ \emph{Initialize}.
+	\item the references need to be carefully checked for formatting. In
+		particular, Markov and Fisher are consistently spelled with
+		a lowercase initial letter.
+\end{itemize}
+
+\paragraph{Recommendation.}
+I believe that the problems about the framing of the paper described in
+\emph{Scope and contributions} warrant at least a major revision. Furthermore,
+the concerns raised in \emph{Major comments} suggest that both the
+methodological and experimental contributions are somewhat limited and might
+justify a rejection. It could however be that the whole is more than the sum of
+its parts, given the relevance and timeliness of the application.
+
+%\bibliographystyle{plain}
+%\bibliography{main}
+
+\end{document}
-- 
cgit v1.2.3-70-g09d2