\documentclass[11pt]{article} \usepackage[T1]{fontenc} \usepackage[utf8]{inputenc} \usepackage[hmargin=1.2in, vmargin=1.4in]{geometry} \title{Review of \emph{Contextual Stochastic Block Model:\\Sharp Thresholds and Contiguity}} \date{} \begin{document} \maketitle \paragraph{Summary.} This paper considers a generalization of the stochastic block model in which additional covariates correlated with the community assignments are observed at each vertex. Specifically, the generative model for a graph with $n$ vertices is as follows: \begin{itemize} \item each vertex $i$ is assigned independently and uniformly at random to one of two community, $\sigma_i\in\{-1, +1\}$. \item two vertices are connected by an edge with probability $a/n$ if they belong to the same community, and with probability $b/n$ otherwise. Define $\lambda^2=1/2\cdot(a-b)^2/(a+b)$. \item a $p$-dimensional vector of covariates $B_i$ is observed at each vertex, where $B_i = \sqrt{\mu/n}\sigma_i u + Z_i$ with $u\sim \mathcal{N}(0,I_p)$ and $Z_i\sim\mathcal{N}(0,I_p)$ and $(Z_1,\dots,Z_n)$ mutually independent. \end{itemize} The general question studied in this work is whether recovery of the latent community assignments is possible from the joint observation of the adjacency matrix of the graph and the covariates $(B_1,\dots,B_n)$, in the asymptotic regime where $n/p\to\gamma >0$. Note that this problem reduces to the standard community detection problem when only the adjacency matrix is observed (or $\mu=0)$, and reduces to a clustering of a Gaussian mixtures when only the covariates $B_i$ are observed (or $\lambda=0$). The main result of this work is an exact characterization of the weak recovery threshold: when $\lambda^2 + \mu^2/\gamma <1$ then weak recovery is impossible, when $\lambda^2+\mu^2/\gamma>1$, the authors describe an estimator, computable in quasi-polynomial time, that achieves weak recovery. \paragraph{Significance of the results.} The question studied in this work is very natural, since a practitioner would typically observe some covariates about the graph's vertices in most community detection applications. Furthermore, the paper completely solves the question it sought to answer by describing a sharp threshold (at least as far as weak recovery is concerned). It is conceptually satisfying that the threshold established in this work generalizes both the recovery threshold for community detection in the stochastic block model, and the “spike detection” threshold in random matrix theory. It is also interesting to see that recovery is sometimes possible by combining information from \emph{both} types of observations (the covariates and the graph) even in cases where each type of information \emph{separately} would be insufficient for recovery. \vspace{1em}\noindent\emph{Model.} However, the specific generative model considered in this work seems a little restrictive. While the stochastic block model and the spiked random matrix model are both well-established and well-studied \emph{separately}, it is not clear that simply “gluing” them together constitutes a natural instantiation of the intuitive concept of \emph{stochastic block model with auxiliary information}. The model studied in this work has only been considered in two previous papers (Yan and Sarkar, 2020; Deshpande et al., 2018) and does not seem to be justified from the modeling perspective, but rather by the fact that it is amenable to analysis by combining existing techniques. This could be improved by providing at the very least a comparison with other variants of stochastic block model with auxiliary information, such as for example the one discussed in \emph{Structure and inference in annotated networks} (Newman and Clauset, 2016). On this note, a more natural modeling approach is to first draw latent covariates for the graph vertices, and \emph{then} to describe the conditional distribution of community assignments given this latent covariates. In contrast, the present work only describes the conditional distribution of the covariates given the community assignments, which does not feel natural in the context of network modeling. Finally, I think the paper would also benefit from a discussion of whether the analysis could generalize to more general models for the covariates $B_i$ (closer to the general spiked matrix models that are now commonly studied). \vspace{1em}\noindent\emph{Technical contribution.} All the theoretical results in this work closely follow the proof structures and techniques from recent work on recovery thresholds for the stochastic block model and spiked random matrix models. The main technical contribution is the introduction of the “cycle statistics” $Y_{n,k,l}$ similar to the ones studied for example by Mossel et al. (2015) in the context of community detection, and by Banerjee and Ma (2018) for spiked random matrices. The recovery algorithm closely follows the approach from Hopkins and Steurer (2017) by constructing a low-degree polynomial approximation of the second moment $\sigma\sigma^T$ of the community assignments. \paragraph{Other comments.} The paper is well-written overall, but I think the proofs could be made clearer if written in a more modular manner. As a simple example, the first part of proof of Theorem 1 in Section 2 is a very general argument that could be factored out (and it is in fact exactly identical to Lemma 2.4 in (Perry et al. 2018) that could be invoked directly). Another example is the proof of Proposition 5 that is particularly long: however the asymptotic distribution under $H_0$ is simply a specific case of the one under $H_1$ when $\lambda=\mu=0$, so it is not clear that it is necessary to conduct both derivations separately. I understand that some parts of the derivation under $H_0$ are reused under $H_1$, but these could be factored out. \paragraph{Conclusion.} This paper considers a natural and important question, and despite the limitations of the model and the relative absence of technical novelty, the theoretical results constitute an important step towards understanding the impact of auxiliary vertex information on the community detection threshold. I recommend acceptance, possibly after consolidating the discussion of the model and modularizing the proofs as suggested above. %\bibliographystyle{alpha} %\bibliography{jmlr-20-208} \end{document}