1 files changed, 140 insertions, 0 deletions
diff --git a/jasa-2019-0653.tex b/jasa-2019-0653.tex
new file mode 100644
index 0000000..0536f51
--- /dev/null
+++ b/jasa-2019-0653.tex
@@ -0,0 +1,140 @@
+\documentclass[10pt]{article}
+\usepackage[T1]{fontenc}
+\usepackage[utf8]{inputenc}
+\usepackage[hmargin=1.2in, vmargin=1.2in]{geometry}
+\usepackage{amsmath,amsfonts}
+
+\title{\large Review of \emph{Real-time Regression Analysis of Streaming Clustered Data with Possible Abnormal Data Batches}}
+\date{}
+
+\begin{document}
+
+\maketitle
+
+\paragraph{Summary.} This paper studies the problem of estimating the
+parameters $\beta\in\mathbb{R}^p$ of a marginal generalized linear model with
+correlated outcomes via the ``quadratic inference function'' (QIF). Minimizing
+the QIF yields an estimator $\hat\beta^\star$ of $\beta$, the ``oracle''
+estimator. Since solving the minimizing problem requires processing all the
+available samples at once and could be too expensive, the authors instead focus
+on a streaming setting, where small clusters of data arrive in batch. The goal
+is to obtain a ``renewable'' estimator such that given the current estimate one
+can process a cluster from the stream and update the estimate on the fly, the
+crucial point being that the update does not require looking at the entire
+dataset and can be performed by only considering the current state and data
+coming from the stream. The contributions are:
+\begin{itemize}
+	\item by considering a linearization of the QIF, the authors obtain
+		a renewable estimator $\hat\beta$ for which the update can be performed
+		by solving an equation involving only the fresh data from the stream
+		and a summary of the past data (in the form of an aggregated gradient
+		and variance matrix). This equation can be solved using the
+		Newton--Raphson method.
+	\item a theoretical analysis of the resulting estimator, showing
+		consistency and asymptotic normality. Furthermore it is shown that in
+		the large data limit, $\hat\beta$ is equivalent to the oracle estimator
+		$\hat\beta^\star$.
+	\item in the presence of abnormal or corrupted data batches generated with
+		model parameters different from $\beta$, the authors propose
+		a goodness-of-fit test by comparing the current batch to the last
+		correct batch (the last one which passed the test) and only use data
+		batches passing the test in the renewable estimator.
+	\item a discussion about implementation of the proposed method on the Spark
+		Lambda architecture.
+	\item an evaluation of the proposed estimator on synthetic data (linear and
+		logistic models) and the monitoring procedure (test of abnormal
+		batches) confirming the theory.
+	\item an application to a real dataset of car crashes.
+\end{itemize}
+
+\paragraph{Major Comments.}
+
+The problem considered in this paper is important and well-motivated and in
+light of the ever-increasing amount of available data, the need for efficient
+estimators is self-evident. The proposed method appears sound and its benefits
+are convincingly demonstrated in the experiments. However, I found that the
+theoretical analysis suffers from some notable gaps as follows:
+\begin{itemize}
+	\item Appendix A.2 does not contain enough details for me to be able to
+		assess the correctness of the proof. While I believe the result to be
+		true (with maybe slightly stronger assumptions) the provided argument
+		does not constitute a proof in my opinion. In particular:
+		\begin{itemize}
+			\item the overall proof structure (by induction) does not seem
+				sound to me: the induction hypothesis is that $\tilde\beta_j$
+				is consistent for $j=1,\dots,b-1$. However, being consistent is
+				an asymptotic property as the amount data goes to infinity.
+				Yet, for a fix $j$, $\tilde\beta_j$ is computed from a fixed
+				and finite amount of data and thus cannot be consistent. Note
+				that $\tilde\beta_j$ could be consistent if $n_j$ goes to
+				infinity, but it is only assumed that $N_b=\sum_{j=1}^b
+				n_j\to\infty$. So in its current form, I don't see how the
+				proposed proof structure addresses the case where $n_j=n$ is
+				constant and only $b$ goes to infinity (which still implies
+				$N_b = nb\to_{b\to\infty}\infty$).
+			\item it is not clear to me why the quantity in equation (a.4) is
+				$O_p(N_b^{-1/2})$ (this relates to the induction hypothesis
+				being ill-defined as per the previous point).
+			\item the argument from lines 35--40 on page 37 seems to use the
+				mean value theorem (although the quantity $\xi_b$ appearing in
+				those lines has not been defined and should be defined).
+				However, the mean value theorem does not hold for vector valued
+				functions. Since ultimately, only an inequality is needed,
+				I suspect an easy workaround is to use the mean value inequality,
+				but as written, the current argument is not valid.
+			\item the proof relies on $\tilde\beta_j$ being in
+				$\mathbb{N}_\delta$ for all $j$ (page 38, line 19). I don't
+				think this is guaranteed by the procedure, is it an
+				additional assumption that the authors rely on? If so, it
+				should be added to the theorem statement (or maybe the
+				assumption on positive-definiteness should be relaxed to hold
+				everywhere).
+			\item page 38, line 21, positive-definiteness of
+				$C_j(\tilde\beta_j)$ is invoked. However, the assumption in the
+				theorem is only that it is positive-definite \emph{in
+				expectation}.  I am not sure how to bridge this gap (should the
+				assumption be strengthened in the theorem statement?)
+		\end{itemize}
+	\item the renewable method requires solving an equation using the
+		Newton--Raphson (NR) method. However, no indication of sufficient
+		conditions for convergence of the NR method are given. In particular,
+		the rate of convergence should be quantified. The pseudocode given on
+		page 16 only says to run it ``until convergence''. However, in
+		practice, the method is only run for a finite number of steps, inducing
+		some residual errors.  A formal analysis of this residual error and
+		guarantees under which it can be made sufficiently small so as not to
+		accumulate too much across steps and jeopardized consistency is
+		important. As currently written, the estimator used in the experiments
+		is not the same as the one which is analyzed theoretically and is not
+		proved to be consistent.
+	\item related to the previous point, I didn't find in the experiments any
+		indication of how the NR method was run, for how many steps, and
+		guidelines on how to chose the number of steps.
+\end{itemize}
+
+A less crucial point, but would still be ``nice to have'' is the following.
+Methods such as stochastic gradient descent are deemed insufficient in the
+introduction (page 2, line 26-29) for the problem considered in this paper.
+However, the problem at hand, with its QIF formulation, seems to fall well within
+the realm of stochastic approximation (where one wishes to minimize a function
+written as an expectation over a distribution, given only samples from this
+distribution). In fact, given that the proposed solution is based on
+a linearization of the QIF, I suspect that what it is doing is very close to
+what stochastic gradient descent applied to this problem would yield. It would
+be nice to give this stochastic gradient descent interpretation of the proposed
+solution, if it is true, or explain how it differs, or give more details about
+why stochastic gradient descent cannot be applied to this problem.
+
+\paragraph{Minor comments.} 
+\begin{itemize}
+	\item page 8 line 10, in the sum, the second $g_i$ should simply be $g$
+	\item page 37 line 23, the leading $+\frac{1}{N_b}$ should be
+		$-\frac{1}{N_b}$
+\end{itemize}
+
+\paragraph{Conclusion.} The problem considered in this paper is important and
+the proposed method interesting and well-justified by the experiments. However,
+a significant tightening of the theoretical analysis (along the lines the above
+points) is necessary.
+
+\end{document}