WWW version

author: Thibaut Horel <thibaut.horel@gmail.com> 2014-11-22 21:08:41 -0500
committer: Thibaut Horel <thibaut.horel@gmail.com> 2014-11-22 21:08:41 -0500
commit: 36eb1fee5492e57368846cbf4e107f1e4cb31589 (patch)
tree: 6380028284779e10d01fb9ff51f3c561ae9ce57c /paper/sections/experiments.tex
parent: 4f7d4804234f5515a4dded8b05d9568653b7ae3c (diff)
download: fast-seeding-36eb1fee5492e57368846cbf4e107f1e4cb31589.tar.gz
1 files changed, 424 insertions, 51 deletions
diff --git a/paper/sections/experiments.tex b/paper/sections/experiments.tex
index 1a085d1..201066a 100644
--- a/paper/sections/experiments.tex
+++ b/paper/sections/experiments.tex
@@ -1,14 +1,88 @@
-In this section, we give experimental evidence of the validity of our approach. Specifically, we show that our algorithms obtain significant improvement over standard influence maximization, that these improvements are observed in various scenarios (different verticals, different models of influence propagation), and that our approach is efficient in terms of running-time and scalable to large social networks.
+In this section we validate the adaptive seeding approach through experimentation.
+%In the sections above we showed how to design algorithms that have provable performance guarantees for adaptive seeding, which and even parallelizable.    
+Specifically, we show that our algorithms for adaptive seeding obtain
+significant improvement over standard influence maximization, that these
+improvements are robust to changes in environment variables, and that our
+approach is efficient in terms of running-time and scalable to large social
+networks.
 
-\subsection{Data description}
 
-Our main testbed is the friendship graph of the social network service Facebook~\cite{facebook}. We start by observing that an efficient campaign run on a social network will primarily target users who have already expressed interests in the topic being promoted. In the adaptive seeding terminology, these users constitute the set of initial nodes that can be targeted during the seeding phase.
+\subsection{Experimental setup}
+We tested our algorithms on three types of datasets. Each of them allows us to
+experiment on a different aspect of the adaptive seeding problem. The Facebook
+Pages dataset that we collected ourselves has a central place in our
+experiments since it is the one which is closet to actual applications of
+adaptive seeding.
 
-In order to reproduce this process, we selected ten different verticals (topics) and for each of these verticals, a person, an institution or an entity whose associated Facebook page is regularly used for promotional posts related to this topic. On each of these pages, we selected a recent post (posted no later than January 2014) with approximately 1,000 \emph{likes}. The set of users who liked those posts constitute our initial sets of nodes.
+\textbf{Synthetic networks.} Using standard models of social networks we
+generated large-scale graphs to model the social network.  To emulate the
+process of users following a topic (the core set $X$) we sampled subsets of
+nodes at random, and applied our algorithms on the sample and their neighbors.
+The main advantage of these data sets is that they allow us to generate graphs
+of arbitrary sizes and experiment with various parameters that govern the
+structure of the graph.  The disadvantages are that users who follow a topic
+are not necessarily random samples, and that social networks often have
+structural properties that are not captured in generative models.
 
-We then crawled the social network of these sets: for each user, we collected her list of friends, and the degrees (number of friends) of these friends. Table~\ref{tab:data} summarizes statistics about the collected data.
+\textbf{Real networks.}  We used publicly available data sets of real social
+networks available at \cite{snapnets}. As for synthetic networks, we used
+a random sample of nodes to emulate users who follow a topic, which is the main
+disadvantage of this approach.  The advantage however, is that such datasets
+contain an entire network which allows testing different propagation
+parameters.   
+  
+\textbf{Facebook Pages.}  We collected data from several Facebook Pages, each
+associated with a commercial entity that uses the Facebook page to communicate
+with its followers. For each page, we selected a post and then collected data
+about the users who expressed interest (``liked'') the post and their friends.
+The advantage of this data set is that it is highly representative of the
+scenario we study here. Campaigns run on a social network will primarily target
+users who have already expressed interests in the topic being promoted. The
+main disadvantage of this method is that such data is extremely difficult to
+collect due to the crawling restrictions that Facebook applies and gives us
+only the 2-hop neighborhood around a post. This makes it difficult to
+experiment with different propagation parameters. Fortunately, as we soon
+discuss, we were able to circumvent some of the crawling restrictions and
+collect large networks, and the properties of the voter influence model are
+such that these datasets suffice to accurately account for influence
+propagation in the graph.\newline
 
-\begin{table}[h!]
+
+\begin{figure}[t]
+    \centering
+    \includegraphics[width=0.4\textwidth]{images/para.pdf}
+    \caption{Comparison of the average degree of the core set users and the
+    average degree of their friends.}
+    \label{fig:paradox}
+    \vspace{-10pt}
+\end{figure}
+
+\noindent\textbf{Data collection.}
+We selected Facebook Pages in different verticals (topics). Each page is
+operated by an institution or an entity whose associated Facebook Page is
+regularly used for promotional posts related to this topic. On each of these
+pages, we selected a recent post (posted no later than January 2014) with
+approximately 1,000 \emph{likes}. The set of users who liked those posts
+constitute our core set.  We then crawled the social network of
+these sets: for each user, we collected her list of friends, and the degrees
+(number of friends) of these friends.\newline
+
+
+\noindent\textbf{Data description.} Among the several verticals we collected,
+we select eight of them for which we will report our results. We obtained
+similar results for the other ones.  Table~\ref{tab:data} summarizes statistics
+about the selected verticals.  We note that depending on the privacy settings
+of the core set users, it was not always possible to access their list of
+friends.  We decided to remove these users since their ability to spread
+information could not be readily determined. This effect, combined with various
+errors encountered during the data collection, accounts for an approximate 15\%
+reduction between the users who liked a post and the number of users in the
+datasets we used.  Following our discussion in the introduction, we observe
+that on average, the degrees of core set users is much lower than the degrees of
+their friends. This is highlighted on Figure~\ref{fig:paradox} and justifies
+our approach.
+
+\begin{table}[t]
     \small
     \centering
     \setlength{\tabcolsep}{3pt}
@@ -18,104 +92,403 @@ We then crawled the social network of these sets: for each user, we collected he
     \midrule
     Charity & Kiva & 978 & 131334 \\%& 134.29 & 1036.26\\
     Travel & Lonely Planet & 753 & 113250 \\%& 150.40 & 898.50\\
-    Public Action & LaManifPourTous & 1041 & 97959 \\%& 94.10 & 722.02\\
+    %Public Action & LaManifPourTous & 1041 & 97959 \\%& 94.10 & 722.02\\
     Fashion & GAP & 996 & 115524 \\%& 115.99 & 681.98\\
     Events & Coachella & 826 & 102291 \\%& 123.84 & 870.16\\
     Politics & Green Party & 1044 & 83490 \\%& 79.97 & 1053.25\\
     Technology & Google Nexus & 895 & 137995 \\%& 154.19 & 827.84\\
     News & The New York Times & 894 & 156222 \\%& 174.74 & 1033.94 \\
-    Consumption & Peet's & 776 & 56268 \\%& 72.51 & 520.47\\
+    %Consumption & Peet's & 776 & 56268 \\%& 72.51 & 520.47\\
     Entertainment & HBO & 828 & 108699 \\%& 131.28 & 924.09\\
     \bottomrule
 \end{tabular}
-\caption{Dataset statistics. $m$: number of initial users, $n$: number of friends of the initial users.}
+\caption{Dataset statistics. $m$: number of users in the core set, $n$: number
+of friends of core set users.}
 %$S$: avg. degree of an initial user, $F$: avg. degree of a friend of an initial user.}
 \label{tab:data}
+    \vspace{-10pt}
 \end{table}
 
-We note that depending on the privacy settings of the initial users, it was not always possible to access their list of friends. We decided to remove these users since their ability to spread information could not be readily determined. This effect, combined with various errors encountered during the data collection, accounts for an approximate 15\% reduction between the users who liked a post and the number of users in the datasets we used.
-
-\begin{figure}[t!]
-    \centering
-    \includegraphics[scale=0.7]{images/para.pdf}
-    \caption{Comparison of the average degree of the initial users and the average degree of their friends.}
-    \label{fig:para}
-\end{figure}
-
-Following our discussion in the introduction, we observe that on average, the degrees of initial users is much lower than the degrees of their friends. This is highlighted on Figure~\ref{fig:para} and justifies our approach.
-
 \subsection{Performance of Adaptive Seeding}
-\label{sec:performance}
+\label{sec:performance} For a given problem instance with a budget of $k$ we
+applied the adaptive seeding algorithm (the combinatorial version). Recall from
+Section~\ref{sec:model} that performance is defined as the expected influence
+that the seeder can obtain by optimally selecting users on the second stage,
+where \emph{influence} is defined as the sum of the degrees of the selected
+users. We tested our algorithm against the following benchmarks:
 
-We start by comparing the performance of adaptive seeding to standard influence maximization approaches. The following heuristics were considered:
 \begin{itemize}
-    \item \emph{Max degree:} selecting as many initial users as the budget permits, in decreasing order of degrees.
-    \item \emph{Random:} selecting a random sample of initial users of the size permitted by the budget.
-    \item \emph{Random friend:} spending half of the budget to select a random sample of initial users. For each of the selected random users, we then select randomly one of her friend (hence spending the total budget overall).
+    \item \emph{Random Node} (\textsf{RN}): we randomly select $k$ users from
+        the core set.  This is a typical benchmark in comparing influence
+        maximization algorithms~\cite{KKT03}.
+   \item \emph{Influence Maximization} (\textsf{IM}): we apply the optimal
+       influence maximization algorithm on the core set.  This is the naive
+       application of influence maximization. For the voter model, when the
+       propagation time is polynomially large in the network size, the optimal
+       solution is to simply take the $k$ highest degree nodes~\cite{even-dar}.
+       We study the case of bounded time horizons in Section~\ref{sec:inf}.
+    \item \emph{Random Friend} (\textsf{RF}): we implement a naive two-stage approach:
+        randomly select $k/2$ nodes from the core set, and for each
+        node select a random neighbor (hence spending the budget of $k$ rewards
+        overall).  This method was recently shown to outperform standard
+        influence maximization when the core set is random~\cite{LS13}.
 \end{itemize}
 
-\begin{figure*}[p]
-    \centerline{\includegraphics{images/perf.pdf}}
-    \caption{Performance of adaptive seeding compared to other influence maximization approaches. The influence, on the $y$-axis is displayed in logarithmic scale and is measured by the sum of the degrees of the selected users~\eqref{eq:voter}. The budget is represented on the $x$-axis as a fraction of the seed size.}
+
+\begin{figure*}[t]
+    \centerline{\includegraphics[width=0.99\textwidth]{images/perf2.pdf}}
+    \vspace{-5pt}
+    \caption{\small{Performance of adaptive seeding compared to other influence
+    maximization approaches. The horizontal axis represents the budget used as
+a fraction of the size of the core set. The vertical axis is the
+expected influence reachable by optimally selecting nodes on the second stage.}}
     \label{fig:performance}
+    \vspace{-10pt}
 \end{figure*}
 
-Figure~\ref{fig:performance} compares the performance of \emph{Adaptive seeding}, our own approach, to the afore-mentioned approaches for all the verticals we collected. For this figure, the probability of initial users' friends joining on the second day was set to one. Section~\ref{sec:robustness} explores other influence propagation scenarios.
+\subsection{Performance on Facebook Pages} Figure~\ref{fig:performance}
+compares the performance of \emph{adaptive seeding}, our own approach, to the
+afore-mentioned approaches for all the verticals we collected.  In this first
+experiment we made simplifying assumptions about the parameters of the model.
+The first assumption is that all probabilities in the adaptive seeding model
+are equal to one. This implicitly assumes that every friend of a user who
+followed a certain topic is interested in promoting the topic given a reward.
+Although this is a strong assumption that we will revisit, we note that the
+probabilities can be controlled to some extent by the social networking service
+on which the campaign is being run by showing prominently the campaign material
+(sponsored links, fund-raising banners, etc.).  The second assumption is that
+the measure of influence is the sum of the degrees of the selected set.  This
+measure is an appealing proxy as it is known that in the voter model, after
+polynomially many time steps, the influence of each node is proportional to its
+degree with high probability~\cite{even-dar}. Since the influence process
+cannot be controlled by the designer, the assumption is often that the
+influence process runs until it stabilizes (in linear thresholds and
+independent cascades for example, the process terminates after a linear number
+of steps~\cite{KKT03}).  We perform a set of experiments for different time
+horizons in Section~\ref{sec:inf}.
+
+ 
 
-We note that the \emph{Random friend} heuristic significantly outperforms \emph{Max degree}. Using the same budget, the degree gain induced by moving from initial users to their friends is such that selecting at random among the initial users' friends already does better than the best heuristic restricted only on initial users. Using \emph{Adaptive seeding} to optimize the choice of initial users based on their friends' degrees then results in an order of magnitude increase over \emph{Random friend}, consistently for all the verticals.
+%One can conside    
+%
+%In the next sections we perform tests with different parameters of probabilities and time horizons.  We   
+%
+%\begin{itemize}
+%\item In the first experiment represented the probability of initial users'
+%friends joining on the second day was set to one, i.e. it assumes that every friend of a user who followed a certain topic is interested in promoting the topic given a reward.
+%In the following sections we analyze cases when the probabilities vary.  
+%We note however that this assumption is not unrealistic since the probabilities can be controlled to some extent by the social networking service on which the campaign is being run.  By showing prominently the campaign
+%material (sponsored links, fund-raising banners, etc.).
+%%, the conversion rate can be increased beyond what would happen via regular word-of-mouth propagation.
+%
+%\item 
+%The measure of influence reported in this experiment, is the sum of the degrees of the selected set.  
+%This measure is an appealing proxy due to known results about influence maximization.  It is shown in~\cite{ES08} that in the voter model, after polynomially many time steps the influence of each node is proportional to its degree, with high probability.  Since the influence process cannot be controlled by the designer, the assumption is often that the influence process runs until it stabilizes (in linear thresholds and independent cascades for example, the process terminates after a linear number of steps~\cite{KKT03}).  We perform a set of experiments for different time horizons in the following section.     
+%\end{itemize}
 
-We also compare the relative improvements of \emph{Adaptive seeding} over \emph{Max degree} across the different verticals. The results are shown in Figure~\ref{fig:compare}.
-\begin{figure}[ht!]
-    \centerline{ \includegraphics[scale=0.7]{images/comp.pdf} }
+%\noindent\textbf{Performance.}
+It is striking to see how well adaptive seeding does in comparison to other
+methods.  Even when using a small budget (0.1 fraction of the core set, which
+in these cases is about 100 nodes), adaptive seeding improves influence by
+a factor of at least 10, across all verticals.  To confirm this, we plot the
+relative improvements of \emph{adaptive seeding} over \textsf{IM} in aggregate
+over the different pages.  The results are shown in Figure~\ref{fig:compare}.
+This dramatic improvement is largely due to the friendship paradox phenomenon
+that adaptive seeding leverages.  Returning to Figure~\ref{fig:performance}, it
+is also interesting to note that the \textsf{RF} heuristic significantly
+outperforms the standard \textsf{IM} benchmark.  Using the same budget, the
+degree gain induced by moving from the core set to its neighborhood is such
+that selecting at random among the core set users' friends already does better
+than the best heuristic restricted only on the core set. Using \emph{adaptive
+seeding} to optimize the choice of core set users based on their friends'
+degrees then results in an order of magnitude increase over \textsf{RF},
+consistently for all the pages.\newline
+
+\begin{figure}[t]
+    \vspace{-10pt}
+    \centerline{ \includegraphics[width=0.4\textwidth]{images/comp2.pdf} }
     \vspace{-10pt}
-    \caption{Ratio of the performance of adaptive seeding to max. degree for several verticals.}
+    \caption{Ratio of the performance of adaptive seeding to \textsf{IM}. Bars represents the mean improvement across all verticals, and the ``error bar'' represents the range of improvement across verticals.}
     \label{fig:compare}
+    \vspace{-15pt}
 \end{figure}
 
-\subsection{Robustness to spread of influence frictions}
+\subsection{The Effect of the Probabilistic Model}
 \label{sec:robustness}
 
-The results presented in Section~\ref{sec:performance} were computed assuming a probability of propagation between the two stages of adaptive seeding equal to one. Estimating this probability is a research problem on its own; however, we note that it can be controlled to some extent by the social networking service on which the campaign is being run. By showing prominently the campaign material (sponsored links, fund-raising banners, etc.), the conversion rate can be increased beyond what would happen via regular word-of-mouth propagation.
+The results presented in Section~\ref{sec:performance} were computed assuming
+the probabilities in the adaptive seeding model are one.  We now describe
+several experiments we performed with the Facebook Pages data set that test the
+advantages of adaptive seeding under different probability models.\newline 
+
+%Estimating this probability is a research problem on its own; however,
+%we note that it can be controlled to some extent by the social networking
+%service on which the campaign is being run. By showing prominently the campaign
+%material (sponsored links, fund-raising banners, etc.), the conversion rate can
+%be increased beyond what would happen via regular word-of-mouth propagation.
+%\newline
 
-Figure~\ref{fig:prob} shows the impact of the probability of propagation between the two stages. For several values of $p$, we computed the performance of \emph{Adaptive seeding} when each friend of a seeded initial user joins during the second stage independently with probability $p$. We see that even with $p=0.01$, \emph{Adaptive seeding} still outperforms \emph{Max degree}. As $p$ increases, the performance of \emph{Adaptive seeding} quickly increases and reaches $80\%$ of the values of Figure~\ref{fig:performance} at $p=0.5$.
+\noindent\textbf{Impact of the Bernouilli parameter.} 
+Figure~\ref{fig:prob} shows the impact of the probability of nodes realizing in
+the second stage.  We computed the performance of \emph{adaptive seeding} when
+each friend of a seeded user in the core set joins during the second stage
+independently with probability $p$, using different values of $p$.  We call $p$
+the \emph{Bernouilli} parameter, since the event that a given user joins on the
+second stage of adaptive seeding is governed by a Bernouilli variable of
+parameter $p$. We see that even with $p=0.01$, \emph{adaptive seeding} still
+outperforms \textsf{IM}.  As $p$ increases, the performance of \emph{adaptive
+seeding} quickly increases and reaches $80\%$ of the values of
+Figure~\ref{fig:performance} at $p=0.5$.\newline
 
-\begin{figure}[ht!]
+\begin{figure}[t]
     \begin{subfigure}[t]{0.23\textwidth}
-    \includegraphics[scale=0.45]{images/prob.pdf}
+    \includegraphics[scale=0.48]{images/prob.pdf}
     \vspace{-10pt}
     \caption{}
     \label{fig:prob}
 \end{subfigure}
+\hspace{1pt}
 \begin{subfigure}[t]{0.23\textwidth}
-    \includegraphics[scale=0.45]{images/hbo_likes.pdf}
-    \vspace{-10pt}
+    \includegraphics[scale=0.48]{images/hbo_likes.pdf}
     \caption{}
     \label{fig:killer}
      \end{subfigure}
-     \caption{(a) Performance of Adaptive seeding in logarithmic scale for various propagation probabilities. (b) In logarithmic scale, performance of \emph{Adaptive seeding} when restricted to users who \emph{liked} HBO (\textsf{Adapt. seed. (rest.)}), compared to \emph{Max degree} and the unrestricted \emph{Adaptive seeding}.}
+ \vspace{-5pt}
+ \caption{\small{(a) Performance of adaptive seeding for various propagation
+     probabilities. (b) Performance of \emph{adaptive seeding} when restricted
+ to the subgraph of users who \emph{liked} HBO (red line).}}
+ \vspace{-20pt}
 \end{figure}
 
-In practice, the propagation probability will vary among individuals. However, for those who have already expressed interest in the promoted content, we can expect this probability to be close to one. For our next experiment, we chose a vertical (HBO) and trimmed the social graph we collected by only keeping on the second stage users who indicated this vertical (HBO) in their list of interests. Figure~\ref{fig:killer} shows that even on this very restricted set of users, \emph{Adaptive seeding} still outperforms \emph{Max degree} and reaches approximately $50\%$ of the unrestricted adaptive seeding.
+\noindent\textbf{Coarse estimation of probabilities.}  
+In practice, the probability a user may be interested in promoting a campaign
+her friend is promoting may vary. However, for those who have already expressed
+interest in the promoted content, we can expect this probability to be close to
+one.  We therefore conducted the following experiment.  We chose a page (HBO)
+and trimmed the social graph we collected by only keeping on the second stage
+users who indicated this page (HBO) in their list of interests.  This is
+a coarse estimation of the probabilities as it assumes that if a friend follows
+HBO she will be willing to promote with probability 1 (given a reward), and
+otherwise the probability of her promoting anything for HBO is 0.
+Figure~\ref{fig:killer} shows that even on this very restricted set of users,
+\emph{adaptive seeding} still outperforms \textsf{IM} and reaches approximately
+$50\%$ of the unrestricted adaptive seeding.\newline
 
-%\item Fix budget; plot the number of people that tweeted coffee between 9am to 10am, between 10am to 11am, between 11am to noon,...,
-%\item Argue: if the probability for a friend joining is small, then as long as each person has a few high degree friends, we will get a constant fraction of high degree friends joining, in expectation.
-%\item Experiment only on friends who also followed the topic: go to a vertical with a lot of followers (but hopefully number of ``likes'' are under 1000) and perform the adaptive seeding experiment only on people who also follow that vertical.  Should roughly be 3x requests as other experiments. 
-%\end{itemize}
+\noindent\textbf{Impact of the probability distribution.} In order to test
+scenarios where users have a rich spectrum of probabilities of realizing on the
+second stage. We consider a setting where the Bernouilli parameter $p$ is drawn
+from a distribution. We considered four different distributions; for each
+distribution for fixed values of the budget and the parameter $p$, we tuned the
+parameters of the distribution so that its mean is exactly $p$. We then plotted
+the performance as a function of the budget and mean $p$. 
+
+For the Beta distribution, we fixed $\beta=5$ and tuned the $\alpha$ parameter
+to obtain a mean of $p$, thus obtaining a unimodal distribution. For the normal
+distribution, we chose a standard deviation of $0.01$ to obtain a distribution
+more concentrated around its mean than the Beta distribution. Finally, for the
+inverse degree distribution, we took the probability of a node joining on
+the second stage to be proportional to the inverse of its degree (scaled so that on
+average, nodes join with probability $p$).  The results are shown in
+Figure~\ref{fig:bernouilli}.
+
+We observe that the results are comparable to the one we obtained in the
+uniform case in Figure~\ref{fig:prob} except in the case of the inverse degree
+distribution for which the performance is roughly halved. Remember that the
+value of a user $v$ on the second stage of adaptive seeding is given by $p_v
+d_v$ where $d_v$ is its degree and $p_v$ is the its probability of realizing on
+the second stage.  Choosing $p_v$ to be proportional to ${1}/{d_v}$ has the
+effect of normalizing the nodes on the second stage and is a strong
+perturbation of the original degree distribution of the nodes available on the
+second stage.
+
+\begin{figure*}[t!]
+\centering
+\begin{subfigure}[b]{0.25\textwidth}
+\includegraphics[width=\textwidth]{images/beta.pdf}
+\caption{Beta distribution}
+\end{subfigure}
+\begin{subfigure}[b]{0.25\textwidth}
+\includegraphics[width=\textwidth]{images/gauss.pdf}
+\caption{Normal Distribution}
+\end{subfigure}
+\begin{subfigure}[b]{0.24\textwidth}
+\includegraphics[width=\textwidth]{images/power.pdf}
+\caption{Power-law distribution}
+\end{subfigure}
+\begin{subfigure}[b]{0.24\textwidth}
+\includegraphics[width=\textwidth]{images/deg.pdf}
+\caption{Inverse degree}
+\end{subfigure}
+\caption{Performance of adaptive seeding as a function of the budget and the
+mean of the distribution from which the Bernouilli parameters are drawn. The
+details of the parameters for each distribution can be found in
+Section~\ref{sec:robustness}.}.
+\label{fig:bernouilli}
+\vspace{-10pt}
+\end{figure*}
+
+\subsection{Impact of the Influence Model}
+\label{sec:inf}
+
+The Facebook Pages data set we collected is limited in that we only have access
+to the 2-hop neighborhood around the seed users and we use the degree of the
+second stage users as a proxy for their influence. As proved in
+\cite{even-dar}, in the voter model, the influence of nodes converges to their
+degree with high probability when the number of time steps become polynomially
+large in the network size.
+
+\begin{figure}
+    \centering
+    \includegraphics[width=0.48\textwidth]{images/voter.pdf}
+    \vspace{-20pt}
+    \caption{Performance of adaptive seeding compared to \textsf{IM} for the voter
+    influence model with $t$ steps.}
+    \vspace{-10pt}
+    \label{fig:voter}
+\end{figure}
+
+In order to analyze the expected number of nodes influenced according to the
+voter model that terminates after some fixed number of time steps, we use
+publicly available data sets from \cite{snapnets} where the entire network is
+at our disposal.  As discussed above, we sample nodes uniformly at random to
+model the core set. We then run the voter model for $t$ time steps
+to compute the influence of the second stage users. Figure~\ref{fig:voter}
+shows the performance of adaptive seeding as a function of $t$ compared to the
+performance of the \textsf{IM} benchmark. In this experiment, the budget was
+set to half the size of the core set.
+
+We see that the performance of adaptive seeding quickly converges (5 time steps
+for \textsf{Slashdot}, 15 time steps for \textsf{Epinions}). In practice, the
+voter model converges much faster than the theoretical guarantee of
+\cite{even-dar}, which justifies using the degree of the second stage users as
+measure of influence as we did for the Facebook Pages data sets.
+Furthermore, we see that similarly to the Facebook data sets, adaptive seeding
+significantly outperforms \textsf{IM}.
+
+
+\subsection{Performance on Synthetic Networks} 
+
+In order to analyze the impact of topological variations we generated synthetic
+graphs using standard network models. All the generated graphs have $100,000$
+vertices, for each model, we tuned the generative parameters to obtain when
+possible a degree distribution (or graph density otherwise) similar to what we
+observed in the Facebook Pages data sets.
+
+\begin{itemize}
+    \item \emph{Barabási-Albert:} this well-known model is often used to model
+        social graphs because its degree distribution is a power law. We took
+        10 initial vertices and added 10 vertices at each step, using the
+        preferential attachment model, until we reached 100,000 vertices.
+    \item \emph{Small-World:} also known as the Watts-Strogatz model. This
+        model was one of the first models proposed for social networks. Its
+        diameter and clustering coefficient are more representative of a social
+        network than what one would get with the Erdős–Rényi model. We started
+        from a regular lattice of degree 200 and rewired each edge with
+        probability 0.3.
+    \item \emph{Kronecker:} Kronecker graphs were more recently introduced in
+        \cite{kronecker} as a scalable and easy-to-fit model for social
+        networks. We started from a star graph with 4 vertices and computed
+        Kronecker products until we reached 100,000 nodes.
+    \item \emph{Configuration model:} The configuration model allows us to
+        construct a graph with a given degree distribution. We chose a page
+        (GAP) and generated a graph with the same degree distribution using the
+        configuration model.
+\end{itemize}
+The performance of adaptive seeding compared to our benchmarks can be found in
+Figure~\ref{fig:synth}. We note that the improvement obtained by adaptive
+seeding is comparable to the one we had on real data except for the
+\emph{Small-World} model. This is explained by the nature of the model:
+starting from a regular lattice, some edges are re-wired at random. This model
+has similar properties to a random graph where the friendship paradox does not
+hold~\cite{LS13}. Since adaptive seeding is designed to leverage the friendship
+paradox, such graphs are not amenable to this approach.
+
+\begin{figure}[t]
+    \centering
+    \includegraphics[width=0.48\textwidth]{images/perf_synth.pdf}
+    \vspace{-15pt}
+    \caption{Performance of adaptive seeding on synthetic networks.}
+    \label{fig:synth}
+    \vspace{-10pt}
+\end{figure}
 
 \subsection{Scalability}\label{sec:scalability}
+To test the scalability of adaptive seeding we were guided by two central
+questions.  First, we were interested to witness the benefit our non-sampling
+approach has over the standard SAA method. Secondly, we wanted to understand
+when one should prefer to use the LP-based approach from Section~\ref{sec:lp}
+over the combinatorial one from Section~\ref{sec:comb}. The computations in
+this section were run on Intel Core i5 CPU 4x2.40Ghz. For each computation, we
+plot the time and number of CPU cycles it took.\newline
+
+\noindent\textbf{Comparison with SAA.} The objective function of the
+non-adaptive problem \eqref{eq:relaxed} is an expectation over exponentially
+many sets, all possible realizations of the neighbors in the second stage.
+Following the sampling-based approach introduced in \cite{singer}, this
+expectation can be computed by averaging the values obtained in
+$O\left(n^2\right)$ independent sample realizations of the second stage users
+($n$ is the number of neighbors of core set users).  One important aspect of
+the algorithms introduced in this paper is that in the additive case, this
+expectation can be computed exactly without sampling, thus significantly
+improving the theoretical complexity.
 
-Finally, we look at the scalability of our approaches. More specifically, we compare the running time of the LP-based approach and the combinatorial approach for different instance sizes.
+In Figure~\ref{fig:sampling}, we compare the running time of our combinatorial
+algorithm to the same algorithm where the expectation is computed via sampling.
+We note that this sampling-based algorithm is still simpler than the algorithm
+introduced in \cite{singer} for general influence models.  However, we observe
+a significant gap between its running time and the one of the combinatorial
+algorithm. Since each sample takes linear time to compute, this gap is in fact
+$O(n^3)$, quickly leading to impracticable running times as the size of the
+graph increases. This highlights the importance of the \emph{sans-sampling}
+approach underlying the algorithms we introduced.\newline
 
-Figure~\ref{fig:time} shows the running time and number of CPU cycles used by the LP algorithm and the combinatorial algorithm as a function of the network size $n$. The varying size of the network was obtained by randomly sampling a varying fraction of initial users and then trimming the social graph by only keeping friends of this random sample on the second stage. The computations were run on Intel Core i5 CPU 4x2.40Ghz. The LP solver used was CLP~\cite{clp}.
+\begin{figure}[t]
+    \centerline{ \includegraphics[width=0.48\textwidth]{images/sampling.pdf} }
+    \vspace{-10pt}
+    \caption{Running time and number of CPU cycles used by the sampling-based
+    algorithm and the combinatorial adaptive seeding algorithm for different
+sizes of the core set.}
+    \label{fig:sampling}
+    \vspace{-15pt}
+\end{figure}
+
+\noindent\textbf{Combinatorial vs. LP algorithm.}
+We now compare the running time of the LP-based approach and the combinatorial
+approach for different instance sizes.
+
+Figure~\ref{fig:time} shows the running time and number of CPU cycles used by
+the LP algorithm and the combinatorial algorithm as a function of the network
+size $n$. The varying size of the network was obtained by randomly sampling
+a varying fraction of core set users and then trimming the social graph by only
+keeping friends of this random sample on the second stage.  The LP solver used
+was CLP~\cite{clp}.
 
-\begin{figure}[h!]
+\begin{figure}[t]
     \centerline{ \includegraphics[scale=0.9]{images/time.pdf} }
     \vspace{-10pt}
     \caption{Running time and number of CPU cycles of the combinatorial algorithm and the LP algorithm as a function of the number of nodes $n$. First row with budget $k=100$, second row with budget $k=500$.}
     \label{fig:time}
+    \vspace{-10pt}
 \end{figure}
 
-We observe that for a \emph{small} value of the budget $k$ ($k=100$, first row of Figure~\ref{fig:time}), the combinatorial algorithm outperforms the LP algorithm. When $k$ becomes \emph{large} ($k=500$, second row of Figure~\ref{fig:time}), the LP algorithm becomes faster. This can be explained by the $k^2$ factor in the running time guarantee of the combinatorial algorithm (see Proposition~\ref{prop:running_time}). Even though the asymptotic $n\log n$ guarantee of the combinatorial algorithm should theoretically outperform the LP-based approach for large $n$, we were not able to observe it for our instance sizes. In practice, one can choose which of the two algorithms to apply depending on the relative sizes of $k$ and $n$.
+We observe that for a \emph{small} value of the budget $k$ (first row
+of Figure~\ref{fig:time}), the combinatorial algorithm outperforms the LP
+algorithm. When $k$ becomes \emph{large} (second row of
+Figure~\ref{fig:time}), the LP algorithm becomes faster. This can be explained
+by the $k^2$ factor in the running time of the combinatorial
+algorithm (see Proposition~\ref{prop:running_time}). Even though the asymptotic
+guarantee of the combinatorial algorithm should theoretically
+outperform the LP-based approach for large $n$, we were not able to observe it
+for our instance sizes. In practice, one can choose which of the two algorithms
+to apply depending on the relative sizes of $k$ and $n$.\newline
+
+
+%\begin{figure}
+%    \centerline{ \includegraphics[width=0.47\textwidth]{images/sampling.pdf} }
+%    \vspace{-10pt}
+%    \caption{}
+%    \label{fig:sampling}
+%\end{figure}
 
 %\noindent \textbf{Others}
 %\begin{itemize}
author	Thibaut Horel <thibaut.horel@gmail.com>	2014-11-22 21:08:41 -0500
committer	Thibaut Horel <thibaut.horel@gmail.com>	2014-11-22 21:08:41 -0500
commit	36eb1fee5492e57368846cbf4e107f1e4cb31589 (patch)
tree	6380028284779e10d01fb9ff51f3c561ae9ce57c /paper/sections/experiments.tex
parent	4f7d4804234f5515a4dded8b05d9568653b7ae3c (diff)
download	fast-seeding-36eb1fee5492e57368846cbf4e107f1e4cb31589.tar.gz