summaryrefslogtreecommitdiffstats
path: root/paper/sections/experiments.tex
blob: 1a085d13433c7852e34d39890873d921f8435101 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
In this section, we give experimental evidence of the validity of our approach. Specifically, we show that our algorithms obtain significant improvement over standard influence maximization, that these improvements are observed in various scenarios (different verticals, different models of influence propagation), and that our approach is efficient in terms of running-time and scalable to large social networks.

\subsection{Data description}

Our main testbed is the friendship graph of the social network service Facebook~\cite{facebook}. We start by observing that an efficient campaign run on a social network will primarily target users who have already expressed interests in the topic being promoted. In the adaptive seeding terminology, these users constitute the set of initial nodes that can be targeted during the seeding phase.

In order to reproduce this process, we selected ten different verticals (topics) and for each of these verticals, a person, an institution or an entity whose associated Facebook page is regularly used for promotional posts related to this topic. On each of these pages, we selected a recent post (posted no later than January 2014) with approximately 1,000 \emph{likes}. The set of users who liked those posts constitute our initial sets of nodes.

We then crawled the social network of these sets: for each user, we collected her list of friends, and the degrees (number of friends) of these friends. Table~\ref{tab:data} summarizes statistics about the collected data.

\begin{table}[h!]
    \small
    \centering
    \setlength{\tabcolsep}{3pt}
    \begin{tabular}{llrr}
    \toprule
    Vertical & Page & $m$ & $n$ \\%& $S$ & $F$\\
    \midrule
    Charity & Kiva & 978 & 131334 \\%& 134.29 & 1036.26\\
    Travel & Lonely Planet & 753 & 113250 \\%& 150.40 & 898.50\\
    Public Action & LaManifPourTous & 1041 & 97959 \\%& 94.10 & 722.02\\
    Fashion & GAP & 996 & 115524 \\%& 115.99 & 681.98\\
    Events & Coachella & 826 & 102291 \\%& 123.84 & 870.16\\
    Politics & Green Party & 1044 & 83490 \\%& 79.97 & 1053.25\\
    Technology & Google Nexus & 895 & 137995 \\%& 154.19 & 827.84\\
    News & The New York Times & 894 & 156222 \\%& 174.74 & 1033.94 \\
    Consumption & Peet's & 776 & 56268 \\%& 72.51 & 520.47\\
    Entertainment & HBO & 828 & 108699 \\%& 131.28 & 924.09\\
    \bottomrule
\end{tabular}
\caption{Dataset statistics. $m$: number of initial users, $n$: number of friends of the initial users.}
%$S$: avg. degree of an initial user, $F$: avg. degree of a friend of an initial user.}
\label{tab:data}
\end{table}

We note that depending on the privacy settings of the initial users, it was not always possible to access their list of friends. We decided to remove these users since their ability to spread information could not be readily determined. This effect, combined with various errors encountered during the data collection, accounts for an approximate 15\% reduction between the users who liked a post and the number of users in the datasets we used.

\begin{figure}[t!]
    \centering
    \includegraphics[scale=0.7]{images/para.pdf}
    \caption{Comparison of the average degree of the initial users and the average degree of their friends.}
    \label{fig:para}
\end{figure}

Following our discussion in the introduction, we observe that on average, the degrees of initial users is much lower than the degrees of their friends. This is highlighted on Figure~\ref{fig:para} and justifies our approach.

\subsection{Performance of Adaptive Seeding}
\label{sec:performance}

We start by comparing the performance of adaptive seeding to standard influence maximization approaches. The following heuristics were considered:
\begin{itemize}
    \item \emph{Max degree:} selecting as many initial users as the budget permits, in decreasing order of degrees.
    \item \emph{Random:} selecting a random sample of initial users of the size permitted by the budget.
    \item \emph{Random friend:} spending half of the budget to select a random sample of initial users. For each of the selected random users, we then select randomly one of her friend (hence spending the total budget overall).
\end{itemize}

\begin{figure*}[p]
    \centerline{\includegraphics{images/perf.pdf}}
    \caption{Performance of adaptive seeding compared to other influence maximization approaches. The influence, on the $y$-axis is displayed in logarithmic scale and is measured by the sum of the degrees of the selected users~\eqref{eq:voter}. The budget is represented on the $x$-axis as a fraction of the seed size.}
    \label{fig:performance}
\end{figure*}

Figure~\ref{fig:performance} compares the performance of \emph{Adaptive seeding}, our own approach, to the afore-mentioned approaches for all the verticals we collected. For this figure, the probability of initial users' friends joining on the second day was set to one. Section~\ref{sec:robustness} explores other influence propagation scenarios.

We note that the \emph{Random friend} heuristic significantly outperforms \emph{Max degree}. Using the same budget, the degree gain induced by moving from initial users to their friends is such that selecting at random among the initial users' friends already does better than the best heuristic restricted only on initial users. Using \emph{Adaptive seeding} to optimize the choice of initial users based on their friends' degrees then results in an order of magnitude increase over \emph{Random friend}, consistently for all the verticals.

We also compare the relative improvements of \emph{Adaptive seeding} over \emph{Max degree} across the different verticals. The results are shown in Figure~\ref{fig:compare}.
\begin{figure}[ht!]
    \centerline{ \includegraphics[scale=0.7]{images/comp.pdf} }
    \vspace{-10pt}
    \caption{Ratio of the performance of adaptive seeding to max. degree for several verticals.}
    \label{fig:compare}
\end{figure}

\subsection{Robustness to spread of influence frictions}
\label{sec:robustness}

The results presented in Section~\ref{sec:performance} were computed assuming a probability of propagation between the two stages of adaptive seeding equal to one. Estimating this probability is a research problem on its own; however, we note that it can be controlled to some extent by the social networking service on which the campaign is being run. By showing prominently the campaign material (sponsored links, fund-raising banners, etc.), the conversion rate can be increased beyond what would happen via regular word-of-mouth propagation.

Figure~\ref{fig:prob} shows the impact of the probability of propagation between the two stages. For several values of $p$, we computed the performance of \emph{Adaptive seeding} when each friend of a seeded initial user joins during the second stage independently with probability $p$. We see that even with $p=0.01$, \emph{Adaptive seeding} still outperforms \emph{Max degree}. As $p$ increases, the performance of \emph{Adaptive seeding} quickly increases and reaches $80\%$ of the values of Figure~\ref{fig:performance} at $p=0.5$.

\begin{figure}[ht!]
    \begin{subfigure}[t]{0.23\textwidth}
    \includegraphics[scale=0.45]{images/prob.pdf}
    \vspace{-10pt}
    \caption{}
    \label{fig:prob}
\end{subfigure}
\begin{subfigure}[t]{0.23\textwidth}
    \includegraphics[scale=0.45]{images/hbo_likes.pdf}
    \vspace{-10pt}
    \caption{}
    \label{fig:killer}
     \end{subfigure}
     \caption{(a) Performance of Adaptive seeding in logarithmic scale for various propagation probabilities. (b) In logarithmic scale, performance of \emph{Adaptive seeding} when restricted to users who \emph{liked} HBO (\textsf{Adapt. seed. (rest.)}), compared to \emph{Max degree} and the unrestricted \emph{Adaptive seeding}.}
\end{figure}

In practice, the propagation probability will vary among individuals. However, for those who have already expressed interest in the promoted content, we can expect this probability to be close to one. For our next experiment, we chose a vertical (HBO) and trimmed the social graph we collected by only keeping on the second stage users who indicated this vertical (HBO) in their list of interests. Figure~\ref{fig:killer} shows that even on this very restricted set of users, \emph{Adaptive seeding} still outperforms \emph{Max degree} and reaches approximately $50\%$ of the unrestricted adaptive seeding.

%\item Fix budget; plot the number of people that tweeted coffee between 9am to 10am, between 10am to 11am, between 11am to noon,...,
%\item Argue: if the probability for a friend joining is small, then as long as each person has a few high degree friends, we will get a constant fraction of high degree friends joining, in expectation.
%\item Experiment only on friends who also followed the topic: go to a vertical with a lot of followers (but hopefully number of ``likes'' are under 1000) and perform the adaptive seeding experiment only on people who also follow that vertical.  Should roughly be 3x requests as other experiments. 
%\end{itemize}

\subsection{Scalability}\label{sec:scalability}

Finally, we look at the scalability of our approaches. More specifically, we compare the running time of the LP-based approach and the combinatorial approach for different instance sizes.

Figure~\ref{fig:time} shows the running time and number of CPU cycles used by the LP algorithm and the combinatorial algorithm as a function of the network size $n$. The varying size of the network was obtained by randomly sampling a varying fraction of initial users and then trimming the social graph by only keeping friends of this random sample on the second stage. The computations were run on Intel Core i5 CPU 4x2.40Ghz. The LP solver used was CLP~\cite{clp}.

\begin{figure}[h!]
    \centerline{ \includegraphics[scale=0.9]{images/time.pdf} }
    \vspace{-10pt}
    \caption{Running time and number of CPU cycles of the combinatorial algorithm and the LP algorithm as a function of the number of nodes $n$. First row with budget $k=100$, second row with budget $k=500$.}
    \label{fig:time}
\end{figure}

We observe that for a \emph{small} value of the budget $k$ ($k=100$, first row of Figure~\ref{fig:time}), the combinatorial algorithm outperforms the LP algorithm. When $k$ becomes \emph{large} ($k=500$, second row of Figure~\ref{fig:time}), the LP algorithm becomes faster. This can be explained by the $k^2$ factor in the running time guarantee of the combinatorial algorithm (see Proposition~\ref{prop:running_time}). Even though the asymptotic $n\log n$ guarantee of the combinatorial algorithm should theoretically outperform the LP-based approach for large $n$, we were not able to observe it for our instance sizes. In practice, one can choose which of the two algorithms to apply depending on the relative sizes of $k$ and $n$.

%\noindent \textbf{Others}
%\begin{itemize}
%\item One plot with Twitter data on the same verticals/ same posts.
%\item Run pipeline on available data sets, with random samples;
%\item It's true that adaptive seeding is looking at more people, but so what?  If you waited a day, it would still be better: take a vertical, split it at random to two days.  Perform adaptive seeding on one day, and compare to influence max. on two days.
%\item compare to other influence maximization algorithms?
%\item test on several public datasets
%\item LP vs combinatorial algorithm vs sampling running time
%\end{itemize}