1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
|
\documentclass[10pt]{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[hmargin=1.2in, vmargin=1.2in]{geometry}
\usepackage{amsmath,amsfonts}
\title{\large Review of \emph{Real-time Regression Analysis of Streaming Clustered Data with Possible Abnormal Data Batches}}
\date{}
\begin{document}
\maketitle
\paragraph{Summary.} This paper studies the problem of estimating the
parameters $\beta\in\mathbb{R}^p$ of a marginal generalized linear model with
correlated outcomes via the ``quadratic inference function'' (QIF). Minimizing
the QIF yields an estimator $\hat\beta^\star$ of $\beta$, the ``oracle''
estimator. Since solving the minimizing problem requires processing all the
available samples at once and could be too expensive, the authors instead focus
on a streaming setting, where small clusters of data arrive in batch. The goal
is to obtain a ``renewable'' estimator such that given the current estimate one
can process a cluster from the stream and update the estimate on the fly, the
crucial point being that the update does not require looking at the entire
dataset and can be performed by only considering the current state and data
coming from the stream. The contributions are:
\begin{itemize}
\item by considering a linearization of the QIF, the authors obtain
a renewable estimator $\hat\beta$ for which the update can be performed
by solving an equation involving only the fresh data from the stream
and a summary of the past data (in the form of an aggregated gradient
and variance matrix). This equation can be solved using the
Newton--Raphson method.
\item a theoretical analysis of the resulting estimator, showing
consistency and asymptotic normality. Furthermore it is shown that in
the large data limit, $\hat\beta$ is equivalent to the oracle estimator
$\hat\beta^\star$.
\item in the presence of abnormal or corrupted data batches generated with
model parameters different from $\beta$, the authors propose
a goodness-of-fit test by comparing the current batch to the last
correct batch (the last one which passed the test) and only use data
batches passing the test in the renewable estimator.
\item a discussion about implementation of the proposed method on the Spark
Lambda architecture.
\item an evaluation of the proposed estimator on synthetic data (linear and
logistic models) and the monitoring procedure (test of abnormal
batches) confirming the theory.
\item an application to a real dataset of car crashes.
\end{itemize}
\paragraph{Major Comments.}
The problem considered in this paper is important and well-motivated and in
light of the ever-increasing amount of available data, the need for efficient
estimators is self-evident. The proposed method appears sound and its benefits
are convincingly demonstrated in the experiments. However, I found that the
theoretical analysis suffers from some notable gaps as follows:
\begin{itemize}
\item Appendix A.2 does not contain enough details for me to be able to
assess the correctness of the proof. While I believe the result to be
true (with maybe slightly stronger assumptions) the provided argument
does not constitute a proof in my opinion. In particular:
\begin{itemize}
\item the overall proof structure (by induction) does not seem
sound to me: the induction hypothesis is that $\tilde\beta_j$
is consistent for $j=1,\dots,b-1$. However, being consistent is
an asymptotic property as the amount data goes to infinity.
Yet, for a fix $j$, $\tilde\beta_j$ is computed from a fixed
and finite amount of data and thus cannot be consistent. Note
that $\tilde\beta_j$ could be consistent if $n_j$ goes to
infinity, but it is only assumed that $N_b=\sum_{j=1}^b
n_j\to\infty$. So in its current form, I don't see how the
proposed proof structure addresses the case where $n_j=n$ is
constant and only $b$ goes to infinity (which still implies
$N_b = nb\to_{b\to\infty}\infty$).
\item it is not clear to me why the quantity in equation (a.4) is
$O_p(N_b^{-1/2})$ (this relates to the induction hypothesis
being ill-defined as per the previous point).
\item the argument from lines 35--40 on page 37 seems to use the
mean value theorem (although the quantity $\xi_b$ appearing in
those lines has not been defined and should be defined).
However, the mean value theorem does not hold for vector valued
functions. Since ultimately, only an inequality is needed,
I suspect an easy workaround is to use the mean value inequality,
but as written, the current argument is not valid.
\item the proof relies on $\tilde\beta_j$ being in
$\mathbb{N}_\delta$ for all $j$ (page 38, line 19). I don't
think this is guaranteed by the procedure, is it an
additional assumption that the authors rely on? If so, it
should be added to the theorem statement (or maybe the
assumption on positive-definiteness should be relaxed to hold
everywhere).
\item page 38, line 21, positive-definiteness of
$C_j(\tilde\beta_j)$ is invoked. However, the assumption in the
theorem is only that it is positive-definite \emph{in
expectation}. I am not sure how to bridge this gap (should the
assumption be strengthened in the theorem statement?)
\end{itemize}
\item the renewable method requires solving an equation using the
Newton--Raphson (NR) method. However, no indication of sufficient
conditions for convergence of the NR method are given. In particular,
the rate of convergence should be quantified. The pseudocode given on
page 16 only says to run it ``until convergence''. However, in
practice, the method is only run for a finite number of steps, inducing
some residual errors. A formal analysis of this residual error and
guarantees under which it can be made sufficiently small so as not to
accumulate too much across steps and jeopardized consistency is
important. As currently written, the estimator used in the experiments
is not the same as the one which is analyzed theoretically and is not
proved to be consistent.
\item related to the previous point, I didn't find in the experiments any
indication of how the NR method was run, for how many steps, and
guidelines on how to chose the number of steps.
\end{itemize}
A less crucial point, but would still be ``nice to have'' is the following.
Methods such as stochastic gradient descent are deemed insufficient in the
introduction (page 2, line 26-29) for the problem considered in this paper.
However, the problem at hand, with its QIF formulation, seems to fall well within
the realm of stochastic approximation (where one wishes to minimize a function
written as an expectation over a distribution, given only samples from this
distribution). In fact, given that the proposed solution is based on
a linearization of the QIF, I suspect that what it is doing is very close to
what stochastic gradient descent applied to this problem would yield. It would
be nice to give this stochastic gradient descent interpretation of the proposed
solution, if it is true, or explain how it differs, or give more details about
why stochastic gradient descent cannot be applied to this problem.
\paragraph{Minor comments.}
\begin{itemize}
\item page 8 line 10, in the sum, the second $g_i$ should simply be $g$
\item page 37 line 23, the leading $+\frac{1}{N_b}$ should be
$-\frac{1}{N_b}$
\end{itemize}
\paragraph{Conclusion.} The problem considered in this paper is important and
the proposed method interesting and well-justified by the experiments. However,
a significant tightening of the theoretical analysis (along the lines the above
points) is necessary.
\end{document}
|