\documentclass[10pt]{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[hmargin=1.2in, vmargin=1.2in]{geometry}
\usepackage{amsmath,amsfonts}

\title{\large Review of \emph{Real-time Regression Analysis of Streaming Clustered Data with Possible Abnormal Data Batches}}
\date{}

\begin{document}

\maketitle

\paragraph{Summary.} This paper studies the problem of estimating the
parameters $\beta\in\mathbb{R}^p$ of a marginal generalized linear model with
correlated outcomes via the ``quadratic inference function'' (QIF). Minimizing
the QIF yields an estimator $\hat\beta^\star$ of $\beta$, the ``oracle''
estimator. Since solving the minimizing problem requires processing all the
available samples at once and could be too expensive, the authors instead focus
on a streaming setting, where small clusters of data arrive in batch. The goal
is to obtain a ``renewable'' estimator such that given the current estimate one
can process a cluster from the stream and update the estimate on the fly, the
crucial point being that the update does not require looking at the entire
dataset and can be performed by only considering the current state and data
coming from the stream. The contributions are:
\begin{itemize}
	\item by considering a linearization of the QIF, the authors obtain
		a renewable estimator $\hat\beta$ for which the update can be performed
		by solving an equation involving only the fresh data from the stream
		and a summary of the past data (in the form of an aggregated gradient
		and variance matrix). This equation can be solved using the
		Newton--Raphson method.
	\item a theoretical analysis of the resulting estimator, showing
		consistency and asymptotic normality. Furthermore it is shown that in
		the large data limit, $\hat\beta$ is equivalent to the oracle estimator
		$\hat\beta^\star$.
	\item in the presence of abnormal or corrupted data batches generated with
		model parameters different from $\beta$, the authors propose
		a goodness-of-fit test by comparing the current batch to the last
		correct batch (the last one which passed the test) and only use data
		batches passing the test in the renewable estimator.
	\item a discussion about implementation of the proposed method on the Spark
		Lambda architecture.
	\item an evaluation of the proposed estimator on synthetic data (linear and
		logistic models) and the monitoring procedure (test of abnormal
		batches) confirming the theory.
	\item an application to a real dataset of car crashes.
\end{itemize}

\paragraph{Major Comments.}

The problem considered in this paper is important and well-motivated and in
light of the ever-increasing amount of available data, the need for efficient
estimators is self-evident. The proposed method appears sound and its benefits
are convincingly demonstrated in the experiments. However, I found that the
theoretical analysis suffers from some notable gaps as follows:
\begin{itemize}
	\item Appendix A.2 does not contain enough details for me to be able to
		assess the correctness of the proof. While I believe the result to be
		true (with maybe slightly stronger assumptions) the provided argument
		does not constitute a proof in my opinion. In particular:
		\begin{itemize}
			\item the overall proof structure (by induction) does not seem
				sound to me: the induction hypothesis is that $\tilde\beta_j$
				is consistent for $j=1,\dots,b-1$. However, being consistent is
				an asymptotic property as the amount data goes to infinity.
				Yet, for a fix $j$, $\tilde\beta_j$ is computed from a fixed
				and finite amount of data and thus cannot be consistent. Note
				that $\tilde\beta_j$ could be consistent if $n_j$ goes to
				infinity, but it is only assumed that $N_b=\sum_{j=1}^b
				n_j\to\infty$. So in its current form, I don't see how the
				proposed proof structure addresses the case where $n_j=n$ is
				constant and only $b$ goes to infinity (which still implies
				$N_b = nb\to_{b\to\infty}\infty$).
			\item it is not clear to me why the quantity in equation (a.4) is
				$O_p(N_b^{-1/2})$ (this relates to the induction hypothesis
				being ill-defined as per the previous point).
			\item the argument from lines 35--40 on page 37 seems to use the
				mean value theorem (although the quantity $\xi_b$ appearing in
				those lines has not been defined and should be defined).
				However, the mean value theorem does not hold for vector valued
				functions. Since ultimately, only an inequality is needed,
				I suspect an easy workaround is to use the mean value inequality,
				but as written, the current argument is not valid.
			\item the proof relies on $\tilde\beta_j$ being in
				$\mathbb{N}_\delta$ for all $j$ (page 38, line 19). I don't
				think this is guaranteed by the procedure, is it an
				additional assumption that the authors rely on? If so, it
				should be added to the theorem statement (or maybe the
				assumption on positive-definiteness should be relaxed to hold
				everywhere).
			\item page 38, line 21, positive-definiteness of
				$C_j(\tilde\beta_j)$ is invoked. However, the assumption in the
				theorem is only that it is positive-definite \emph{in
				expectation}.  I am not sure how to bridge this gap (should the
				assumption be strengthened in the theorem statement?)
		\end{itemize}
	\item the renewable method requires solving an equation using the
		Newton--Raphson (NR) method. However, no indication of sufficient
		conditions for convergence of the NR method are given. In particular,
		the rate of convergence should be quantified. The pseudocode given on
		page 16 only says to run it ``until convergence''. However, in
		practice, the method is only run for a finite number of steps, inducing
		some residual errors.  A formal analysis of this residual error and
		guarantees under which it can be made sufficiently small so as not to
		accumulate too much across steps and jeopardized consistency is
		important. As currently written, the estimator used in the experiments
		is not the same as the one which is analyzed theoretically and is not
		proved to be consistent.
	\item related to the previous point, I didn't find in the experiments any
		indication of how the NR method was run, for how many steps, and
		guidelines on how to chose the number of steps.
\end{itemize}

A less crucial point, but would still be ``nice to have'' is the following.
Methods such as stochastic gradient descent are deemed insufficient in the
introduction (page 2, line 26-29) for the problem considered in this paper.
However, the problem at hand, with its QIF formulation, seems to fall well within
the realm of stochastic approximation (where one wishes to minimize a function
written as an expectation over a distribution, given only samples from this
distribution). In fact, given that the proposed solution is based on
a linearization of the QIF, I suspect that what it is doing is very close to
what stochastic gradient descent applied to this problem would yield. It would
be nice to give this stochastic gradient descent interpretation of the proposed
solution, if it is true, or explain how it differs, or give more details about
why stochastic gradient descent cannot be applied to this problem.

\paragraph{Minor comments.} 
\begin{itemize}
	\item page 8 line 10, in the sum, the second $g_i$ should simply be $g$
	\item page 37 line 23, the leading $+\frac{1}{N_b}$ should be
		$-\frac{1}{N_b}$
\end{itemize}

\paragraph{Conclusion.} The problem considered in this paper is important and
the proposed method interesting and well-justified by the experiments. However,
a significant tightening of the theoretical analysis (along the lines the above
points) is necessary.

\end{document}