summaryrefslogtreecommitdiffstats
path: root/cs284/main.tex
diff options
context:
space:
mode:
authorThibaut Horel <thibaut.horel@gmail.com>2015-09-14 23:08:02 -0400
committerThibaut Horel <thibaut.horel@gmail.com>2015-09-14 23:08:02 -0400
commitab0b1f3cefedb35327a19ec1b6afd560bfdf802d (patch)
treeb777f3e2c0ac0e712d8c5faab5107b1d236e2c3a /cs284/main.tex
parent960676226862d2d68c7a9c04c56d4f8157803025 (diff)
downloadcriminal_cascades-ab0b1f3cefedb35327a19ec1b6afd560bfdf802d.tar.gz
Import supplements and repo reorganization
Diffstat (limited to 'cs284/main.tex')
-rw-r--r--cs284/main.tex309
1 files changed, 0 insertions, 309 deletions
diff --git a/cs284/main.tex b/cs284/main.tex
deleted file mode 100644
index 4bfbd25..0000000
--- a/cs284/main.tex
+++ /dev/null
@@ -1,309 +0,0 @@
-\documentclass[10pt]{article}
-\usepackage[utf8]{inputenc}
-\usepackage{graphicx}
-\usepackage{subcaption}
-\usepackage{booktabs}
-\usepackage{amsmath,amsfonts,amsthm}
-\usepackage{algorithm, algpseudocode}
-\DeclareMathOperator{\E}{\mathbb{E}}
-\let\P\relax
-\DeclareMathOperator{\P}{\mathbb{P}}
-\newcommand{\ex}[1]{\E\left[#1\right]}
-\newcommand{\prob}[1]{\P\left[#1\right]}
-\newcommand{\eqdef}{\mathbin{\stackrel{\rm def}{=}}}
-\DeclareMathOperator*{\argmax}{argmax}
-\algrenewcommand\algorithmicrequire{\textbf{Input:}}
-\algrenewcommand\algorithmicensure{\textbf{Output:}}
-
-\title{Identifying Cascades of Violence}
-\author{Ben Green \and Thibaut Horel}
-\date{December 8, 2014}
-
-\begin{document}
-
-\maketitle
-
-\section{Introduction}
-Most discussions of violence evaluate crime in broad terms: this neighborhood is dangerous, that population is at risk. Yet such dialogue fails to explain why some people living in the same areas and with similar traits are shot while others are not.
-
-Recent studies of of criminal behavior have shown that violence is highly concentrated in small subsets of the population – not a neighborhood or social group, but a social network. In Chicago, a network with only 6\% of the population contains 70\% of the city’s non-fatal gunshot victims \cite{papachristos2014tragic}. The same study also found that one’s risk of being shot depends significantly on the number of gunshot victims in his or her neighborhood in the network. These and similar results imply that network analyses of crime and violence are critical in gaining a more nuanced understanding how violence clusters and spreads.
-
-Yet these snapshots of violence in social networks do not explain how or why observed patterns occur. Our goal in this study is to to consider dynamic processes among criminals to answer the followng question:
-
-\emph{What does data on criminal behavior reveal about how violence spreads in a social network?}
-
-We know from previous work that violence is clustered in criminal networks \cite{papachristos2014tragic, papachristos2014exposure}, implying that social processes may play a role in spreading violence. If this is the case, we expect to observe incidents of violence spreading through the network as an epidemic, similar to infections of a virus spreading through a population.
-
-Identifying patterns of violent behavior will reveal new insights about how to prevent it. Rather than targeting a broad demographic of people or even an entire social network, law enforcement and social services could pinpoint a small group of individuals who are at risk of violence or who they expect to spread it widely.
-
-Note however that a cascade of violence does not refer to direct propagation through the network, such as individual $u$ shoots individual $v$, who then later shoots individual $w$. Instead, we view the spread of violence as a more indirect process mediated by peer influence and exposure to risky situations.
-
-The main contributions of this work are as follows:
-\begin{itemize}
-\item by comparing simulated data to ground truth data, we both corroborate prior work on the importance of peer influence in the spread of violence as well as give evidence for its cascade-like nature.
-\item we discuss and propose a model for cascades of violence in criminal networks. In contrast to previous models of influence cascades, our model captures factors specific to the spread of violence and is able to incorporate our domain-specific data.
-\item we design an algorithm to recover the most likely cascades under our cascade model.
-\item finally, we conducted preliminary analyses on the cascades identified by this algorithm in our data.
-\end{itemize}
-
-\section{Previous Research on Influence Cascades}
-
-The Internet, and online social networks in particular, generate vast quantities of data that reflect the social processes of information diffusion. These include users sharing content on Facebook or Twitter and news sources reporting the same information. Data about these events are valuable for identifying trends, but can also be used to study the structure of social interactions behind information propagation.
-
-Since the seminal work of Kempe, Kleinberg, and Tardos \cite{kkt03}, there has been a large body of work proposing and exploiting models for influence propagation.
-
-In \cite{gomez2012netinf}, the authors developed a maximum likelihood model to infer the network underlying a set of cascades. They know the times that nodes were infected, and use this to infer which other nodes could have caused this infection. From this they can accurately generate the whole network. Similar models and algorithms are used in \cite{netrapalli2012learning, rodriguez2011uncovering}.
-
-Our problem is similar to these network inference problems but flips the question slightly. Instead of using infection times to infer the underlying network, we use infection times and the network to infer cascades. Another extension our work makes is to use additional data about the nodes beyond just their infection time.
-
-Another area of research is analyzing cascades to understand their structure and predict which are likely to be large. This is a very important task, since in certain situations (such as epidemics) a rare but very large cascade could have disastrous consequences. The authors of \cite{cheng2014cascades} developed many metrics for evaluating cascades, and used these to classify cascades in progress based on how large they are expected to grow.
-
-Applying these techniques to our data will enable us to analyze the structure of violence cascades to parse the social processes at play. Further work would allow us to identify which cascades are likely to become large, and should therefore be targeted by law enforcement and social services.
-
-\section{Data}
-Our data consists of police records from the City of Chicago, IL between 2006 and 2014 and contains:
-\begin{itemize}
-\item Arrest records
-\item Demographic data
-\item Victims and dates of non-fatal gunshots
-\end{itemize}
-
-Every person arrested during the study period represents a node. We generate a co-offending network by assigning edges between all pairs of individuals who have been arrested together. We define infection as being the victim of a gunshot. This yields a network similar (albeit much larger) to the one shown in Figure \ref{fig:lcc-newark}.
-
-\begin{figure}
-\centering
-\includegraphics[width=.8\textwidth]{lcc-newark}
-\caption{The largest connected component with labeled victims of the co-offending network in Newark, NJ.}
-\label{fig:lcc-newark}
-\end{figure}
-
-We fist decomposed the network into disjoint connected components which yielded 233,854 components, only one of which has more than 100 nodes. A summary of statistics for the co-offending network and its largest connected component can be found in Table~\ref{tab:data}. This distribution of infections provides a first hint for a contagion effect of violence: the largest connected component contains 31\% of the nodes but 74\% of the gunshot victims.
-
-\begin{table}[t]
- \centering
- \setlength{\tabcolsep}{5pt}
- \begin{tabular}{rlll}
- \toprule
- & Nodes & Edges & Victims \\
- \midrule
- Entire network & 396,042 & 415,454 & 11,206 \\
- Largest component & 123,506 & 366,514 & 8,237 \\
- \bottomrule
-\end{tabular}
-\caption{Summary of statistics for the co-offending network and its largest connected component.}
-\label{tab:data}
-\end{table}
-
-The largest connected component also is a small-world network. We plotted its degree distribution and observed that it approximately follows a power-law distribution (Figure \ref{fig:lcc-dd}). Furthermore, the largest connected component has a clustering coefficient of 0.43 and an average path length of 8.47. In comparison, an Erd\H{o}s-Réyni graph of the same size has a clustering coefficient of 0.000043 and an average path length of 6.78. Since the largest connected component has much larger clustering and only a slightly larger average path length, it is a small-world network.
-
-\begin{figure}
-\centering
-\includegraphics[width=.7\textwidth]{lcc-dd}
-\caption{The degree distribution of the LCC appears to follow a power-law distribution.}
-\label{fig:lcc-dd}
-\end{figure}
-
-The largest connected component is the only one large enough for analysis, so we will henceforth refer to it as our network and ignore the other components.
-
-As with any network study, co-offending networks do not provide a perfect picture of all social interactions. There are clearly many social connections that do not lead to shared arrests. Yet research has shown that most individuals who co-offend also associate with one another in other contexts \cite{papachristos2014tragic}. We posit that co-offending networks networks are therefore a conservative, yet reliable, estimate of association between individuals.
-
-\section{Simulating Cascades}
-As a first test for evidence of violence cascades, we considered how likely it would be to observe our data if all infections occurred independently without any influence process. By simulating infections in the network under the assumption of independent infection times, we can compare our data to a null model.
-
-We developed three methods to simulate independent infections:
-\begin{enumerate}
-\item Assume that every uninfected node has a small probability to become infected each day: $n_{victims}/(n_{nodes} \cdot n_{days})$. We calculated this value such that at the end of the study period we will have, in expectation, the same number of infected nodes as the data with uniform infection rates across the study period.
-\item Take the same set of infected nodes as the data, and assign each an infection date taken uniformly at random from the entire study period.
-\item Take the same set of infected nodes and infection dates as the data, and shuffle the matching between nodes and dates by randomly assigning a date to each infected node.
-\end{enumerate}
-
-The first simulation method makes no assumptions about who is likely to be infected or when. The second accounts for non-uniform likelihoods of certain nodes to become infected. This is suggested by the data since, for example, victim nodes in the data have mean degree above that of the entire network.
-
-The third simulation method builds on the second by maintaining the same infection dates as well as the same victim nodes. This accounts for the nonuniform probability of being infected at different times during the study period. Previous studies have shown that violence peaks during the summer and declines during the winter, a seasonality that our data corroborates (Figure \ref{fig:inf-dates}). This third simulation method, then, allows us to ask the following question: given that certain people are more likely to be infected and that more infections occur at certain times of the year, do we still observe evidence for violence spreading through the network as a contagion?
-
-\begin{figure}
-\centering
-\includegraphics[width=.75\textwidth]{infection-dates}
-\caption{Violence peaks in the summer and declines in the winter. Day 1 is the date of the first observed infection: February 8, 2006.}
-\label{fig:inf-dates}
-\end{figure}
-
-We evaluated three metrics of how clustered infection times are in the data and simulations:
-\begin{enumerate}
-\item The mean time between infections for all pairs of infected neighbors (nodes in the network that are both infected and have an edge between them). (547.6 days in the data)
-\item The median time between infections for all pairs of infected neighbors. (378 days)
-\item What fraction of infected neighbors were infected within 100 days of each other. (0.18)
-\end{enumerate}
-
-Because we have the same set of infected nodes in the data and all simulations, the number of infected neighbor pairs is almost identical in all cases. This ensures that we are considering samples of the same size when measuring these distributions.
-
-We generated 200 study periods using simulation methods 2 and 3. As expected, infected neighbors become infected closer together in time in simulation 3 than in simulation 2. Yet in both cases we did not see results that could have generated the data (Figure \ref{fig:sims}). The mean and median time between infections was smaller than occurred in any simulation, and there were many more pairs infected within 100 days of each other. This implies that the observed pattern of infections between neighbors could not have occurred without some social dependence or interaction.
-
-\begin{figure}
-\centering
-\begin{subfigure}[b]{0.495\textwidth}
-\includegraphics[width=\textwidth]{sim23-mean-time}
-\end{subfigure}
-\begin{subfigure}[b]{0.495\textwidth}
-\includegraphics[width=\textwidth]{sim23-mean-100}
-\end{subfigure}
-\caption{Time between neighbor infections in simulation 2 (green) and simulation 3 (blue) versus the data (red).}
-\label{fig:sims}
-\end{figure}
-
-Finally, we ran our algorithm for cascade recovery (see Section~\ref{sec:recover}) to recover cascades from one instance of the second simulation. This only identified five cascades, the largest of which contained only 4 nodes. This is to be contrasted to the cascades we recovered in our dataset and is yet another evidence of the cascade-like nature of the spread of violence observed in our dataset.
-
-\section{Model}
-
-We first define a probabilistic model for the propagation of violence along a single edge, and then describe the probabilistic model for the cascades of violence.
-
-\subsection{Pairwise Propagation and Time Model}
-
-For a pair $(u, v)$ of victims nodes such that $t_u < t_v$, we model the probability of propagation along the \emph{directed} edge $e = (u,v)$ as $p_e = p_s\cdot p_h\cdot p_t$ where:
-\begin{itemize}
-\item $p_t$ is the \emph{temporal component}: network cascades are inherently time-dependent, and it is typical to assume that the probability of infection from $u$ to $v$ is a decreasing function of the difference in infection times $\Delta t=t_v-t_u$. Here, we chose a power law model:
-\begin{displaymath}
-p_t(\Delta t) = \frac{\gamma-1}{\alpha}\cdot\frac{1}{\left(1+\frac{\Delta t}{\alpha}\right)^{\gamma}}
-\end{displaymath}
-where the parameter $\alpha$ controls for the decay rate of the temporal component as the difference in infection times increases.
-We deferred to future work the analysis of other time models such as an exponential or a Rayleigh model.
-The probability of infection along an edge $(u,v)$ where $t_u > t_v$ is $0$.
-For all tests described in this paper we used the value $\gamma = 1.01$.
-
-\item $p_h$ is the \emph{homophily component}. Because we have more data about each node than just its time of infection, we are able to estimate how likely one node is to infect another based on how similar the two nodes are. We assume that infections are more likely to occur between similar nodes, since these individuals are more likely to interact with one another or be exposed to the same risks. We will calculate the homophily between each pair of nodes using cosine similarity.
-\item $p_s$ is the \emph{structural component}. In two 2014 papers, Papachristos showed that the contagion effects of gun violence are not restricted to the neighbors of a gunshot victim \cite{papachristos2014tragic, papachristos2014exposure}. In cascade terms, even if a node's immediate neighbors are not infected it is still at risk of being infected via indirect exposure by infected nodes two or three hops away in the network. Using the values calculated in \cite{papachristos2014tragic, papachristos2014exposure}, we define $p_s$ as follows:
-\begin{equation*}
-p_s = \begin{cases}
- 0 & \text{if } d(u,v) \geq 4 \\
- 0.15\cdot 0.57^{1-d_{u,v}} & \text{otherwise} \\
- \end{cases}
-\end{equation*}
-where $d_{u,v}$ denotes the shortest path distance between $u$ and $v$ in the co-offending network. This structural component also accounts for the fact that we might have a partial view of the underlying social network: the co-offending network only reveals some of the edges of the ``true'' social network and the structural component defined above has the effect of adding edges to the co-offending network with a probability decaying exponentially as the distance in the original network increases.
-\end{itemize}
-
-The interpretation of the temporal component requires some precision. The infection from $u$ to $v$ can be seen as a two-step process:
-\begin{itemize}
-\item first choose whether or not an infection is going to occur according to a Bernouilli variable of parameter $p_s\cdot p_h$.
-\item then if an infection occurs, an \emph{incubation period} $\Delta t$ is drawn from the probabilistic model given by the temporal component. The infection of $v$ will occur at the end of the incubation period: $t_v = t_u + \Delta t$.
-\end{itemize}
-In particular, this allows us to define the probability that we do not observe an infection between a victim node $u$ and a non-victim node $v$ during our observation period. Let us denote by $t_{max}$ the time at which we stopped our observation, then:
-\begin{equation}
-\label{eq:pnot}
-\begin{split}
- \tilde{p}_{u, v} \eqdef\prob{u \text{ does not infect } v} &= 1 - p_s\cdot p_h + p_s\cdot p_h \prob{t_u + \Delta t > t_{max}}\\
- &= 1 - p_s\cdot p_h(1- \prob{t_u + \Delta t > t_{max}})
- \end{split}
-\end{equation}
-The term $1- \prob{t_u + \Delta t > t_{max}}$ can simply be interpreted as the probability that the infection time of $v$ falls within our observation period. For the probabilistic temporal model defined above, this can be computed explicitly:
-\begin{align*}
-1- \prob{t_u + \Delta t > t_{max}} &= \prob{\Delta t \leq t_{max} - t_u}\\
-&= \int_{0}^{t_{max}-t_u}\frac{\gamma-1}{\alpha}\cdot\frac{1}{\left(1+\frac{\Delta t}{\alpha}\right)^{\gamma}} = 1 - \frac{1}{\left(1+\frac{t_{max} - t_u}{\alpha}\right)^{\gamma}}
-\end{align*}
-
-\subsection{Cascade Model} \label{sec:cascade-model}
-
-The pairwise propagation model defined in the previous section naturally orients the edges of the co-offending network: an edge is oriented from $u$ to $v$ if the infection $t_u$ is smaller than $t_v$, and edges are always oriented from infected nodes to non infected nodes (since their infection time would then have to be larger than $t_{max}$). Furthermore, because of the structural component, we are considering more edges than there are in the original network: for any vertex $u$ within a radius three of $v$ we have an edge from $u$ to $v$ if $t_u< t_v$ or if $v$ is not infected.
-
-We first note that the resulting directed graph is acyclic. Indeed, assume that there was a cycle $(u_1,\ldots, u_k, u_1)$, then by definition of the temporal model we would have $t_1 < t_2 < \ldots t_k < t_1$ which is a contradiction. We will henceforth denote by $D$ this directed acyclic graph (DAG). We considered three models to describe cascades occuring withing $D$:
-\begin{itemize}
-\item \emph{DAG cascades}: in this model, we simply define a cascade as a subgraph of $D$ which forms a DAG. In particular, a cascade can have several seed nodes and a node can have multiple parents. This model is similar to the independent cascade model of \cite{kkt03}, the main differences being that new seed nodes can appear at any time (instead of solely at the initial time) and that cascades do not overlap.
-\item \emph{tree cascades}: in this model, a cascade is defined as a subgraph of $D$ which forms a tree. This implies that each cascade has a single seed node, the root, and that each infected node only has one parent. A similar model can be found in \cite{gomez2012netinf}, but as we discuss below, we had to model the probability of spawning a new cascade differently.
-\item \emph{hybrid model}: in this model a cascade has a single seed node, but we remove the constraint that an infected node has at most one parent. A cascade is thus a subgraph of $D$ which forms a DAG with a single source.
-\end{itemize}
-
-Even though we believe that the hybrid model better captures reality, we focused on the tree model in this work because of its simplicity. We defer the analysis of the hybrid model to future work.
-
-Let us now consider a set of cascades in the tree model, we can write this set as $\mathcal{T} = (T_1,\ldots, T_m)$ where $T_i$, $1\leq i\leq m$ are vertex disjoint trees. In order to define the probability of occurrence of this set of cascades, we need a probabilistic model for the spawn of a new cascade: each node can spontaneously initiate a new cascade of violence with probability $\beta$. This leads to the following probability of occurence for the cascades in $\mathcal{T}$:
-\begin{displaymath}
-\prob{\mathcal{T}\,|\, \alpha, \beta} = \prod_{i=1}^m \beta \prod_{e\in T_i} (1-\beta)p_e\prod_{e\in\delta(T_i)}(1-\beta)\tilde{p}_e
-\end{displaymath}
-where $\tilde{p}_e$ was defined in \eqref{eq:pnot} and where we define $\delta(T_i)$ to be the set of edges going from any node in $T_i$ to a non-infected node. Denoting by $E(\mathcal{T})$ the set of edges in $\mathcal{T}$ and by $\delta(\mathcal{T})$ the set of edges from any node in $\mathcal{T}$ to a non-infected node, this can be rewritten as:
-\begin{displaymath}
-\prob{\mathcal{T}\,|\, \alpha, \beta} = \beta^m(1-\beta)^{n-m}\prod_{e\in\mathcal{T}} p_e\prod_{e\in \delta(\mathcal{T})}\tilde{p}_e
-\end{displaymath}
-where $n$ is the total network of nodes $D$.
-
-\section{Recovering Cascades} \label{sec:recover}
-
-Given the probabilistic model described in the previous section, the problem of recovering cascades from the observed infection times can now be stated as a maximum likelihood estimation problem:
-\begin{equation}\label{eq:prob}
-\mathcal{T}^* = \argmax_{\mathcal{T}}\max_{\alpha, \beta}\prob{\mathcal{T}\,|\,\alpha, \beta}
-\end{equation}
-where the optimization is done over the model parameters ($\alpha$ and $\beta$) and over the set of feasible $\mathcal{T}$. It follows from our cascade model that $\mathcal{T}$ is feasible if and only if it comprises a set of directed edges between pairs of infected nodes such that each infected node has at most one predecessor.
-
-This suggests the following algorithm: for each infected node $v$, we have to decide whether or not we should consider it the root of a new cascade, or part of an already existing cascade. In the latter case, since $v$ can have at most one predecessor, we should select its most likely predecessor, that is, the predecessor $u$ for which $p_{u,v}$ is maximal among $u$'s predecessors. To decide between these two cases, we again consider which one is the most likely; for a predecessor $v$ of $u$ this amounts to the following comparison:
-\begin{displaymath}
-(1-\beta)p_{u,v} \geq \beta.
-\end{displaymath}
-By defining the threshold $\tau_{\beta} = \frac{\beta}{1-\beta}$, we see that it suffices to compare $p_{u,v}$ to the threshold value $\tau_{\beta}$. The full specification of the algorithm is given in Algorithm~\ref{alg:recover}.
-
-\begin{algorithm}
-\caption{Cascade Recovering Algorithm}
-\label{alg:recover}
- \begin{algorithmic}[1]
- \Require set $I$ of infected nodes, infection times $t_u$ for $u\in I$. Parameters $\alpha$, $\beta$
- \Ensure set $E$ of edges forming a forest of cascades
- \Statex
- \State $E\gets\emptyset$
- \State $\tau_{\beta}\gets \frac{\beta}{1-\beta}$
- \ForAll{$v\in I$}
- \State $u\gets\argmax_{w}p_{w,v}$
- \If{$p_{u,v}\geq \tau_{\beta}$}
- \State $E\gets E\cup\{(u.v)\}$ \Comment{$v$ was infected by $u$}
- \Else
- \State do nothing \Comment{$v$ spawned a new cascade}
- \EndIf
- \EndFor
- \State return $S$
- \end{algorithmic}
-\end{algorithm}
-
-It is interesting to note that as we decrease the value of $\beta$ relatively to the edge transmission probabilities, we will ``encourage'' more nodes to spawn new cascades rather than being infected by another infected node. Hence, we can see $\beta$ as a parameter controlling for the the size and number of cascades that we will recover.
-
-Based on the discussion above, for fixed values of $\alpha$ and $\beta$, Algorithm~\ref{alg:recover} solves the optimization problem \eqref{eq:prob} optimally. Thus, we are left with optimizing over $\alpha$ and $\beta$. Figure~\ref{fig:ll} shows the likelihood of the most likely set of cascades as a function of $\alpha$ and $\beta$. From this plot, we see that the function does not have any easy structure (like concavity) amenable to efficient optimization. Furthermore, it also presents many local minima which undermines the use of local search or gradient descent algorithms. As a consequence, we obtained the optimal values for $\alpha$ and $\beta$ through brute-force exploration of the parameter space. It seems that an analytical solution for $\alpha$ and $\beta$ could be obtained, but we defer this to future work.
-
-\begin{figure}
-\centering
-\includegraphics[width=1\textwidth]{ll}
-\caption{Log-likelihood of the most likely set of cascades as a function of $\alpha$ and $\beta$}
-\label{fig:ll}
-\end{figure}
-
-\section{Analyzing Cascades}
-We ran our algorithm to identify violence cascades for one set of parameters suggested by our analyses in Section \ref{sec:recover}: $\alpha = 40$ and $\beta = 0.055$.
-
-We return a set of 7,289 trees, 6,654 of which are isolated nodes. This means that we have 635 cascades in which a seed node infected at least one other node. The largest cascade contains 18 nodes. As shown in Figure \ref{fig:cascade-sizes}, the distribution of sizes returned by our algorithm follows a power law. This is in line with previous literature that has found cascade sizes to have this property \cite{cheng2014cascades}.
-
-\begin{figure}
-\centering
-\includegraphics[width=0.75\textwidth]{cascade-size-distribution}
-\caption{The distribution of cascade sizes follows a power-law.}
-\label{fig:cascade-sizes}
-\end{figure}
-
-The two largest cascades are plotted in Figure \ref{fig:cascades}. We conducted some preliminary analyses of the individuals in the cascades to gain a deeper understanding of how demographics affects the spread of violence. In both cascades, all victims are African-American males and the age ranges for the two cascades are 16--26 and 17--30, respectively (looking at the two cascades from left to right). In the first cascade, 14 out of 18 nodes are in a gang--4 in one gang and 10 in another. In the second cascade, 9 out of 10 nodes are in a gang--6 in one and 3 in another. Hence, violence spreads both within and across gangs, but this preliminary analysis gives some evidence for a more likely intra-gang influence. Interestingly, the same gang is responsible for the sets of 10 and 6 individuals belonging to the same gang in each cascade.
-
-\begin{figure}
-\centering
-\includegraphics[width=\textwidth]{cascades2}
-\caption{Two cascades returned by our algorithm. The seed nodes are marked in red. Edge thicknesses show the relative probability of transmission along the edge.}
-\label{fig:cascades}
-\end{figure}
-
-\section{Future Work}
-There are many areas to conduct further research in this project. We describe these below, in approximate order of short- to long-term goals:
-
-\begin{itemize}
-\item Further refine our model and try to gain better performance.
-\item Perform a more extensive search over the parameter space of $\alpha, \beta$. If possible, derive an analytical solution for the maximum likelihood solution over $\alpha$ and $\beta$. We should also consider how our parameters used in the structural component $p_s$ affect results.
-\item Evaluate how other temporal models influence our results. We can alter the scaling parameter $\gamma$ in the power-law model we have used so far, and also test Rayleigh and exponential time distributions.
-\item Use the demographic data about each person in the network to build a similarity score between individuals. This homophily component will allow us to more accurately estimate infection propagation along edges (assuming that a node is more likely to infect nodes with similar rather than different characteristics, such as age and race).
-\item Generate synthetic data and find sample data with which to test our model. One challenge of our present task is that there exists no ground truth for violence cascades (that is the point of our project!). But to trust the cascades we find in the data, we must first show that our algorithm returns accurate results for instances with a groundtruth.
-\item Develop an algorithm for the DAG cascade model and the hybrid model described in section \ref{sec:cascade-model}. If it is possible to solve the optimization problem in those cases, then we can compare how the inferred cascades compare with those from our current model.
-\item See whether or not we can use our cascade propagation model to predict the spread of violence. We could truncate our dataset to keep only nodes infected before a given date and try to predict the nodes that are going to be infected next. Another interesting analysis would be to predict which nodes will be the seeds of a cascade, and how large those cascades will be.
-\item Develop a model to identify which nodes and edges are most crucial for the propagation of violence. In other words, develop a model to remove nodes from the network in order to cause the greatest decrease of violence. This is has been identified in the literature as the \textsc{firefighter} problem.
-\end{itemize}
-
-\bibliographystyle{plain}
-\bibliography{main}
-\end{document}