summaryrefslogtreecommitdiffstats
path: root/note/main.tex
blob: 336aadf885c35cddb4bd889a96551cec2a1bd3e7 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
\documentclass[10pt]{article}
\usepackage{amsmath,amsfonts,amsthm}
\usepackage[english]{babel}
\usepackage{paralist} 
\usepackage[utf8x]{inputenc}
\usepackage[pagebackref=false,breaklinks=true,colorlinks=true,citecolor=blue]{hyperref}
\usepackage[capitalize, noabbrev]{cleveref}
\usepackage[square,sort]{natbib}


% these are compressed lists to help fit into a 1 page limit
\newenvironment{enumerate*}%
  {\vspace{-2ex} \begin{enumerate} %
     \setlength{\itemsep}{-1ex} \setlength{\parsep}{0pt}}%
  {\end{enumerate}}
 
\newenvironment{itemize*}%
  {\vspace{-2ex} \begin{itemize} %
     \setlength{\itemsep}{-1ex} \setlength{\parsep}{0pt}}%
  {\end{itemize}}
 
\newenvironment{description*}%
  {\vspace{-2ex} \begin{description} %
     \setlength{\itemsep}{-1ex} \setlength{\parsep}{0pt}}%
  {\end{description}}

\newcommand\blfootnote[1]{%
  \begingroup
  \renewcommand\thefootnote{}\footnote{#1}%
  \addtocounter{footnote}{-1}%
  \endgroup
}


\DeclareMathOperator*{\E}{\mathbf{E}}
\let\Pr\relax
\DeclareMathOperator*{\Pr}{\mathbf{P}}
\DeclareMathOperator*{\I}{\mathcal{I}}
\DeclareMathOperator*{\TPRev}{\textsc{TPRev}}
\DeclareMathOperator*{\Rev}{\textsc{Rev}}
\DeclareMathOperator*{\Val}{\textsc{Val}}
\DeclareMathOperator*{\BRev}{\textsc{BRev}}
\DeclareMathOperator*{\SRev}{\textsc{SRev}}

\newcommand{\inprod}[1]{\left\langle #1 \right\rangle}
\newcommand{\R}{\mathbb{R}}
\newcommand{\M}{\mathfrak{M}}
\newcommand{\eqdef}{\mathbin{\stackrel{\rm def}{=}}}
\newcommand{\llbracket}{[\![}

\newtheorem{theorem}{Theorem}
\newtheorem{lemma}{Lemma}
\newtheorem{corollary}{Corollary}
\newtheorem*{definition}{Definition}
\newtheorem*{goal}{Goal}

\newenvironment{proofsketch}[1][Proof Sketch]{\proof[#1]\mbox{}\\*}{\endproof}

\author{Thibaut Horel}
\title{Notes on Adaptive Seeding Code}
\date{}
\begin{document}
%include all references

\maketitle

The purpose of this document is to describe the code I wrote for the paper
\cite{horel}. In this paper, we analyzed and evaluated adaptive strategies to
select influential nodes in a social network. Some of the experiments required
testing our algorithms on real social networks. Specifically, I conducted
experiments on the Facebook social graph and the Twitter social graph. Since
this data is not readily available at the scale required by our experiments,
I had to resort to various distributed mining and scraping methods as explained
below. One of the algorithms we evaluated was a distributed algorithm using the
MapReduce framework. The list of technologies used was:
\begin{itemize}
    \item Python as a general scripting and ``glue'' language.
    \item Twitter Stream and REST API for the publicly available data.
    \item Selenium\footnote{\url{http://www.seleniumhq.org/}} with
    PhantomJS\footnote{\url{http://phantomjs.org/}} for Twitter and Facebook scraping: Selenium
    is a library to  remote control a browser. PhantomJS is a ``headless''
    browser (meaning that it doesn't run in graphic mode) based on Webkit.
    Using Selenium with  a fully functional web browser instead of simple HTTP
    requests (with a tool like CURL) for scraping is crucial for websites like
    Twitter and Facebook where a lot of the content is loaded dynamically with
    Javascript based on the client's browser actions (like scrolling).
    \item
    BeautifulSoup\footnote{\url{http://www.crummy.com/software/BeautifulSoup/}}:
    an HTML parsing library based on lxml2.
    \item Celery\footnote{\url{http://www.celeryproject.org/}}: a Python library which provides a distributed task queue to
    distribute tasks over several machines.
    \item Bottle\footnote{\url{http://bottlepy.org/docs/dev/index.html}}: a Python library to build lightweights HTTP servers to allow
    communications between the master node and client nodes in situations that
    Celery could not accommodate.
    \item Amazon EC2: the cloud computing platform used to distribute the data
    mining as well as running our MapReduce algorithm.
    \item Cython\footnote{\url{http://cython.org/}}: a tool to convert
    ``type-annotated'' Python code into both a C library and the associated
    Python wrapping code, to make parts of our algorithms run faster.
\end{itemize}

In what follows, the names of the files are relative to the code archive
available at \url{http://thibaut.horel.org/fscode.tar.bz2}. As a simple metric,
the entire project contains $\sim$3000 lines of code. Note that this document only
describes the code, we refer the reader to the main paper for more details on
the scientific content.

\section{Twitter Data Collection}

The code for this part is available in the \textsf{twitter} folder and
comprises about 750 lines of code. The data needed was as follows:
\begin{enumerate*}
\item starting from a hand-selected list of popular hashtags, collect the list
of twitter accounts which posted a tweet with one these hashtags over a given
period of time, we call them the ``root'' accounts.
\item for each of those accounts, collect their list of followers.
\item for each of the followers, collect their number of followers, that is the
number of followers of the root accounts' followers.
\end{enumerate*}

Step 1. was done using the Twitter Streaming API, the code is in
\textsf{twitter/stream.py}. Step 2. and 3. used both Twitter's REST API and
Scraping: indeed, we had many accounts for which to collect the list of
followers and a short amount of time to conduct our experiments. Scraping with
Selenium and PhantomJS was used to scrape the small accounts (with few
followers) so as not to ``waste'' requests to the REST API for larger accounts.
\textsf{twitter/scraper.py} contains the scraping code, while
\textsf{twitter/api.py} contains the code making requests to the REST API.
\textsf{twitter/dispatcher.py} dispatches the tasks for short and long accounts
to clients. The clients run the code in \textsf{twitter/main.py}, which among
other things contains a lightweight HTTP server which to which the tasks are
sent, the tasks are then consumed by either the scraping or the REST API based
on the size of the account being mined.

\section{Facebook Data Collection}

The data needed from Facebook was as follows:
\begin{enumerate}
    \item starting from a list of posts on a hand-selected list of Facebook
    Pages, collect the list of Facebook users who liked one of these posts.
    \item for each of these users, collect their list of friends.
    \item for each of the friends, collect their number of friends.
\end{enumerate}

The data was exclusively collected using scraping with Selenium, PhantomJS and
BeautifulSoup. The list of pages to scrape was added to a task queue using
Celery. The code of the dispatcher can be found in \textsf{facebook/run.py}.
The clients running on Amazon EC2 were connected to the task queue to obtain
the list of pages to scrape. The code for the clients is in
\textsf{facebook/client/}. A lightweight HTTP server written with Bottle was
allowing the clients to dynamically request Login credentials from a rotating
list of accounts.

\section{MapReduce algorithm and Data Analysis}

The implementation of the MapReduce algorithm can be found in the \textsf{mr}
folder. The main part of the algorihtm was factored out and implemented in
C/Python with Cython for performance reasons.

The data analysis code can be found in the \textsf{analysis} folder. This part
of the code only contains ``standard'' data processing and plotting code.

\bibliographystyle{abbrvnat}
\bibliography{main}
\end{document}