diff options
Diffstat (limited to 'jasa-2019-0653.tex')
| -rw-r--r-- | jasa-2019-0653.tex | 140 |
1 files changed, 140 insertions, 0 deletions
diff --git a/jasa-2019-0653.tex b/jasa-2019-0653.tex new file mode 100644 index 0000000..0536f51 --- /dev/null +++ b/jasa-2019-0653.tex @@ -0,0 +1,140 @@ +\documentclass[10pt]{article} +\usepackage[T1]{fontenc} +\usepackage[utf8]{inputenc} +\usepackage[hmargin=1.2in, vmargin=1.2in]{geometry} +\usepackage{amsmath,amsfonts} + +\title{\large Review of \emph{Real-time Regression Analysis of Streaming Clustered Data with Possible Abnormal Data Batches}} +\date{} + +\begin{document} + +\maketitle + +\paragraph{Summary.} This paper studies the problem of estimating the +parameters $\beta\in\mathbb{R}^p$ of a marginal generalized linear model with +correlated outcomes via the ``quadratic inference function'' (QIF). Minimizing +the QIF yields an estimator $\hat\beta^\star$ of $\beta$, the ``oracle'' +estimator. Since solving the minimizing problem requires processing all the +available samples at once and could be too expensive, the authors instead focus +on a streaming setting, where small clusters of data arrive in batch. The goal +is to obtain a ``renewable'' estimator such that given the current estimate one +can process a cluster from the stream and update the estimate on the fly, the +crucial point being that the update does not require looking at the entire +dataset and can be performed by only considering the current state and data +coming from the stream. The contributions are: +\begin{itemize} + \item by considering a linearization of the QIF, the authors obtain + a renewable estimator $\hat\beta$ for which the update can be performed + by solving an equation involving only the fresh data from the stream + and a summary of the past data (in the form of an aggregated gradient + and variance matrix). This equation can be solved using the + Newton--Raphson method. + \item a theoretical analysis of the resulting estimator, showing + consistency and asymptotic normality. Furthermore it is shown that in + the large data limit, $\hat\beta$ is equivalent to the oracle estimator + $\hat\beta^\star$. + \item in the presence of abnormal or corrupted data batches generated with + model parameters different from $\beta$, the authors propose + a goodness-of-fit test by comparing the current batch to the last + correct batch (the last one which passed the test) and only use data + batches passing the test in the renewable estimator. + \item a discussion about implementation of the proposed method on the Spark + Lambda architecture. + \item an evaluation of the proposed estimator on synthetic data (linear and + logistic models) and the monitoring procedure (test of abnormal + batches) confirming the theory. + \item an application to a real dataset of car crashes. +\end{itemize} + +\paragraph{Major Comments.} + +The problem considered in this paper is important and well-motivated and in +light of the ever-increasing amount of available data, the need for efficient +estimators is self-evident. The proposed method appears sound and its benefits +are convincingly demonstrated in the experiments. However, I found that the +theoretical analysis suffers from some notable gaps as follows: +\begin{itemize} + \item Appendix A.2 does not contain enough details for me to be able to + assess the correctness of the proof. While I believe the result to be + true (with maybe slightly stronger assumptions) the provided argument + does not constitute a proof in my opinion. In particular: + \begin{itemize} + \item the overall proof structure (by induction) does not seem + sound to me: the induction hypothesis is that $\tilde\beta_j$ + is consistent for $j=1,\dots,b-1$. However, being consistent is + an asymptotic property as the amount data goes to infinity. + Yet, for a fix $j$, $\tilde\beta_j$ is computed from a fixed + and finite amount of data and thus cannot be consistent. Note + that $\tilde\beta_j$ could be consistent if $n_j$ goes to + infinity, but it is only assumed that $N_b=\sum_{j=1}^b + n_j\to\infty$. So in its current form, I don't see how the + proposed proof structure addresses the case where $n_j=n$ is + constant and only $b$ goes to infinity (which still implies + $N_b = nb\to_{b\to\infty}\infty$). + \item it is not clear to me why the quantity in equation (a.4) is + $O_p(N_b^{-1/2})$ (this relates to the induction hypothesis + being ill-defined as per the previous point). + \item the argument from lines 35--40 on page 37 seems to use the + mean value theorem (although the quantity $\xi_b$ appearing in + those lines has not been defined and should be defined). + However, the mean value theorem does not hold for vector valued + functions. Since ultimately, only an inequality is needed, + I suspect an easy workaround is to use the mean value inequality, + but as written, the current argument is not valid. + \item the proof relies on $\tilde\beta_j$ being in + $\mathbb{N}_\delta$ for all $j$ (page 38, line 19). I don't + think this is guaranteed by the procedure, is it an + additional assumption that the authors rely on? If so, it + should be added to the theorem statement (or maybe the + assumption on positive-definiteness should be relaxed to hold + everywhere). + \item page 38, line 21, positive-definiteness of + $C_j(\tilde\beta_j)$ is invoked. However, the assumption in the + theorem is only that it is positive-definite \emph{in + expectation}. I am not sure how to bridge this gap (should the + assumption be strengthened in the theorem statement?) + \end{itemize} + \item the renewable method requires solving an equation using the + Newton--Raphson (NR) method. However, no indication of sufficient + conditions for convergence of the NR method are given. In particular, + the rate of convergence should be quantified. The pseudocode given on + page 16 only says to run it ``until convergence''. However, in + practice, the method is only run for a finite number of steps, inducing + some residual errors. A formal analysis of this residual error and + guarantees under which it can be made sufficiently small so as not to + accumulate too much across steps and jeopardized consistency is + important. As currently written, the estimator used in the experiments + is not the same as the one which is analyzed theoretically and is not + proved to be consistent. + \item related to the previous point, I didn't find in the experiments any + indication of how the NR method was run, for how many steps, and + guidelines on how to chose the number of steps. +\end{itemize} + +A less crucial point, but would still be ``nice to have'' is the following. +Methods such as stochastic gradient descent are deemed insufficient in the +introduction (page 2, line 26-29) for the problem considered in this paper. +However, the problem at hand, with its QIF formulation, seems to fall well within +the realm of stochastic approximation (where one wishes to minimize a function +written as an expectation over a distribution, given only samples from this +distribution). In fact, given that the proposed solution is based on +a linearization of the QIF, I suspect that what it is doing is very close to +what stochastic gradient descent applied to this problem would yield. It would +be nice to give this stochastic gradient descent interpretation of the proposed +solution, if it is true, or explain how it differs, or give more details about +why stochastic gradient descent cannot be applied to this problem. + +\paragraph{Minor comments.} +\begin{itemize} + \item page 8 line 10, in the sum, the second $g_i$ should simply be $g$ + \item page 37 line 23, the leading $+\frac{1}{N_b}$ should be + $-\frac{1}{N_b}$ +\end{itemize} + +\paragraph{Conclusion.} The problem considered in this paper is important and +the proposed method interesting and well-justified by the experiments. However, +a significant tightening of the theoretical analysis (along the lines the above +points) is necessary. + +\end{document} |
