\documentclass[10pt]{article} \usepackage{amsmath,amsfonts,amsthm} \usepackage[english]{babel} \usepackage{paralist} \usepackage[utf8x]{inputenc} \usepackage[pagebackref=false,breaklinks=true,colorlinks=true,citecolor=blue]{hyperref} \usepackage[capitalize, noabbrev]{cleveref} \usepackage[square,sort]{natbib} % these are compressed lists to help fit into a 1 page limit \newenvironment{enumerate*}% {\vspace{-2ex} \begin{enumerate} % \setlength{\itemsep}{-1ex} \setlength{\parsep}{0pt}}% {\end{enumerate}} \newenvironment{itemize*}% {\vspace{-2ex} \begin{itemize} % \setlength{\itemsep}{-1ex} \setlength{\parsep}{0pt}}% {\end{itemize}} \newenvironment{description*}% {\vspace{-2ex} \begin{description} % \setlength{\itemsep}{-1ex} \setlength{\parsep}{0pt}}% {\end{description}} \newcommand\blfootnote[1]{% \begingroup \renewcommand\thefootnote{}\footnote{#1}% \addtocounter{footnote}{-1}% \endgroup } \DeclareMathOperator*{\E}{\mathbf{E}} \let\Pr\relax \DeclareMathOperator*{\Pr}{\mathbf{P}} \DeclareMathOperator*{\I}{\mathcal{I}} \DeclareMathOperator*{\TPRev}{\textsc{TPRev}} \DeclareMathOperator*{\Rev}{\textsc{Rev}} \DeclareMathOperator*{\Val}{\textsc{Val}} \DeclareMathOperator*{\BRev}{\textsc{BRev}} \DeclareMathOperator*{\SRev}{\textsc{SRev}} \newcommand{\inprod}[1]{\left\langle #1 \right\rangle} \newcommand{\R}{\mathbb{R}} \newcommand{\M}{\mathfrak{M}} \newcommand{\eqdef}{\mathbin{\stackrel{\rm def}{=}}} \newcommand{\llbracket}{[\![} \newtheorem{theorem}{Theorem} \newtheorem{lemma}{Lemma} \newtheorem{corollary}{Corollary} \newtheorem*{definition}{Definition} \newtheorem*{goal}{Goal} \newenvironment{proofsketch}[1][Proof Sketch]{\proof[#1]\mbox{}\\*}{\endproof} \author{Thibaut Horel} \title{Notes on Adaptive Seeding Code} \date{} \begin{document} %include all references \maketitle The purpose of this document is to describe the code I wrote for the paper \cite{horel}. In this paper, we analyzed and evaluated adaptive strategies to select influential nodes in a social network. Some of the experiments required testing our algorithms on real social networks. Specifically, I conducted experiments on the Facebook social graph and the Twitter social graph. Since this data is not readily available at the scale required by our experiments, I had to resort to various distributed mining and scraping methods as explained below. One of the algorithms we evaluated was a distributed algorithm using the MapReduce framework. The list of technologies used was: \begin{itemize} \item Python as a general scripting and ``glue'' language. \item Twitter Stream and REST API for the publicly available data. \item Selenium\footnote{\url{http://www.seleniumhq.org/}} with PhantomJS\footnote{\url{http://phantomjs.org/}} for Twitter and Facebook scraping: Selenium is a library to remote control a browser. PhantomJS is a ``headless'' browser (meaning that it doesn't run in graphic mode) based on Webkit. Using Selenium with a fully functional web browser instead of simple HTTP requests (with a tool like CURL) for scraping is crucial for websites like Twitter and Facebook where a lot of the content is loaded dynamically with Javascript based on the client's browser actions (like scrolling). \item BeautifulSoup\footnote{\url{http://www.crummy.com/software/BeautifulSoup/}}: an HTML parsing library based on lxml2. \item Celery\footnote{\url{http://www.celeryproject.org/}}: a Python library which provides a distributed task queue to distribute tasks over several machines. \item Bottle\footnote{\url{http://bottlepy.org/docs/dev/index.html}}: a Python library to build lightweights HTTP servers to allow communications between the master node and client nodes in situations that Celery could not accommodate. \item Amazon EC2: the cloud computing platform used to distribute the data mining as well as running our MapReduce algorithm. \item Cython\footnote{\url{http://cython.org/}}: a tool to convert ``type-annotated'' Python code into both a C library and the associated Python wrapping code, to make parts of our algorithms run faster. \end{itemize} In what follows, the names of the files are relative to the code archive available at \url{http://thibaut.horel.org/fscode.tar.bz2}. As a simple metric, the entire project contains $\sim$3000 lines of code. Note that this document only describes the code, we refer the reader to the main paper for more details on the scientific content. \section{Twitter Data Collection} The code for this part is available in the \textsf{twitter} folder and comprises about 750 lines of code. The data needed was as follows: \begin{enumerate*} \item starting from a hand-selected list of popular hashtags, collect the list of twitter accounts which posted a tweet with one these hashtags over a given period of time, we call them the ``root'' accounts. \item for each of those accounts, collect their list of followers. \item for each of the followers, collect their number of followers, that is the number of followers of the root accounts' followers. \end{enumerate*} Step 1. was done using the Twitter Streaming API, the code is in \textsf{twitter/stream.py}. Step 2. and 3. used both Twitter's REST API and Scraping: indeed, we had many accounts for which to collect the list of followers and a short amount of time to conduct our experiments. Scraping with Selenium and PhantomJS was used to scrape the small accounts (with few followers) so as not to ``waste'' requests to the REST API for larger accounts. \textsf{twitter/scraper.py} contains the scraping code, while \textsf{twitter/api.py} contains the code making requests to the REST API. \textsf{twitter/dispatcher.py} dispatches the tasks for short and long accounts to clients. The clients run the code in \textsf{twitter/main.py}, which among other things contains a lightweight HTTP server which to which the tasks are sent, the tasks are then consumed by either the scraping or the REST API based on the size of the account being mined. \section{Facebook Data Collection} The data needed from Facebook was as follows: \begin{enumerate} \item starting from a list of posts on a hand-selected list of Facebook Pages, collect the list of Facebook users who liked one of these posts. \item for each of these users, collect their list of friends. \item for each of the friends, collect their number of friends. \end{enumerate} The data was exclusively collected using scraping with Selenium, PhantomJS and BeautifulSoup. The list of pages to scrape was added to a task queue using Celery. The code of the dispatcher can be found in \textsf{facebook/run.py}. The clients running on Amazon EC2 were connected to the task queue to obtain the list of pages to scrape. The code for the clients is in \textsf{facebook/client/}. A lightweight HTTP server written with Bottle was allowing the clients to dynamically request Login credentials from a rotating list of accounts. \section{MapReduce algorithm and Data Analysis} The implementation of the MapReduce algorithm can be found in the \textsf{mr} folder. The main part of the algorihtm was factored out and implemented in C/Python with Cython for performance reasons. The data analysis code can be found in the \textsf{analysis} folder. This part of the code only contains ``standard'' data processing and plotting code. \bibliographystyle{abbrvnat} \bibliography{main} \end{document}