\documentclass[10pt]{article} \usepackage[T1]{fontenc} \usepackage[utf8]{inputenc} \usepackage[hmargin=1.2in, vmargin=1.2in]{geometry} \usepackage{amsmath,amsfonts} \title{\large Review of \emph{Real-time Regression Analysis of Streaming Clustered Data with Possible Abnormal Data Batches}} \date{} \begin{document} \maketitle \paragraph{Summary.} This paper studies the problem of estimating the parameters $\beta\in\mathbb{R}^p$ of a marginal generalized linear model with correlated outcomes via the ``quadratic inference function'' (QIF). Minimizing the QIF yields an estimator $\hat\beta^\star$ of $\beta$, the ``oracle'' estimator. Since solving the minimizing problem requires processing all the available samples at once and could be too expensive, the authors instead focus on a streaming setting, where small clusters of data arrive in batch. The goal is to obtain a ``renewable'' estimator such that given the current estimate one can process a cluster from the stream and update the estimate on the fly, the crucial point being that the update does not require looking at the entire dataset and can be performed by only considering the current state and data coming from the stream. The contributions are: \begin{itemize} \item by considering a linearization of the QIF, the authors obtain a renewable estimator $\hat\beta$ for which the update can be performed by solving an equation involving only the fresh data from the stream and a summary of the past data (in the form of an aggregated gradient and variance matrix). This equation can be solved using the Newton--Raphson method. \item a theoretical analysis of the resulting estimator, showing consistency and asymptotic normality. Furthermore it is shown that in the large data limit, $\hat\beta$ is equivalent to the oracle estimator $\hat\beta^\star$. \item in the presence of abnormal or corrupted data batches generated with model parameters different from $\beta$, the authors propose a goodness-of-fit test by comparing the current batch to the last correct batch (the last one which passed the test) and only use data batches passing the test in the renewable estimator. \item a discussion about implementation of the proposed method on the Spark Lambda architecture. \item an evaluation of the proposed estimator on synthetic data (linear and logistic models) and the monitoring procedure (test of abnormal batches) confirming the theory. \item an application to a real dataset of car crashes. \end{itemize} \paragraph{Major Comments.} The problem considered in this paper is important and well-motivated and in light of the ever-increasing amount of available data, the need for efficient estimators is self-evident. The proposed method appears sound and its benefits are convincingly demonstrated in the experiments. However, I found that the theoretical analysis suffers from some notable gaps as follows: \begin{itemize} \item Appendix A.2 does not contain enough details for me to be able to assess the correctness of the proof. While I believe the result to be true (with maybe slightly stronger assumptions) the provided argument does not constitute a proof in my opinion. In particular: \begin{itemize} \item the overall proof structure (by induction) does not seem sound to me: the induction hypothesis is that $\tilde\beta_j$ is consistent for $j=1,\dots,b-1$. However, being consistent is an asymptotic property as the amount data goes to infinity. Yet, for a fix $j$, $\tilde\beta_j$ is computed from a fixed and finite amount of data and thus cannot be consistent. Note that $\tilde\beta_j$ could be consistent if $n_j$ goes to infinity, but it is only assumed that $N_b=\sum_{j=1}^b n_j\to\infty$. So in its current form, I don't see how the proposed proof structure addresses the case where $n_j=n$ is constant and only $b$ goes to infinity (which still implies $N_b = nb\to_{b\to\infty}\infty$). \item it is not clear to me why the quantity in equation (a.4) is $O_p(N_b^{-1/2})$ (this relates to the induction hypothesis being ill-defined as per the previous point). \item the argument from lines 35--40 on page 37 seems to use the mean value theorem (although the quantity $\xi_b$ appearing in those lines has not been defined and should be defined). However, the mean value theorem does not hold for vector valued functions. Since ultimately, only an inequality is needed, I suspect an easy workaround is to use the mean value inequality, but as written, the current argument is not valid. \item the proof relies on $\tilde\beta_j$ being in $\mathbb{N}_\delta$ for all $j$ (page 38, line 19). I don't think this is guaranteed by the procedure, is it an additional assumption that the authors rely on? If so, it should be added to the theorem statement (or maybe the assumption on positive-definiteness should be relaxed to hold everywhere). \item page 38, line 21, positive-definiteness of $C_j(\tilde\beta_j)$ is invoked. However, the assumption in the theorem is only that it is positive-definite \emph{in expectation}. I am not sure how to bridge this gap (should the assumption be strengthened in the theorem statement?) \end{itemize} \item the renewable method requires solving an equation using the Newton--Raphson (NR) method. However, no indication of sufficient conditions for convergence of the NR method are given. In particular, the rate of convergence should be quantified. The pseudocode given on page 16 only says to run it ``until convergence''. However, in practice, the method is only run for a finite number of steps, inducing some residual errors. A formal analysis of this residual error and guarantees under which it can be made sufficiently small so as not to accumulate too much across steps and jeopardized consistency is important. As currently written, the estimator used in the experiments is not the same as the one which is analyzed theoretically and is not proved to be consistent. \item related to the previous point, I didn't find in the experiments any indication of how the NR method was run, for how many steps, and guidelines on how to chose the number of steps. \end{itemize} A less crucial point, but would still be ``nice to have'' is the following. Methods such as stochastic gradient descent are deemed insufficient in the introduction (page 2, line 26-29) for the problem considered in this paper. However, the problem at hand, with its QIF formulation, seems to fall well within the realm of stochastic approximation (where one wishes to minimize a function written as an expectation over a distribution, given only samples from this distribution). In fact, given that the proposed solution is based on a linearization of the QIF, I suspect that what it is doing is very close to what stochastic gradient descent applied to this problem would yield. It would be nice to give this stochastic gradient descent interpretation of the proposed solution, if it is true, or explain how it differs, or give more details about why stochastic gradient descent cannot be applied to this problem. \paragraph{Minor comments.} \begin{itemize} \item page 8 line 10, in the sum, the second $g_i$ should simply be $g$ \item page 37 line 23, the leading $+\frac{1}{N_b}$ should be $-\frac{1}{N_b}$ \end{itemize} \paragraph{Conclusion.} The problem considered in this paper is important and the proposed method interesting and well-justified by the experiments. However, a significant tightening of the theoretical analysis (along the lines the above points) is necessary. \end{document}