summaryrefslogtreecommitdiffstats
path: root/note
diff options
context:
space:
mode:
authorThibaut Horel <thibaut.horel@gmail.com>2015-09-24 02:06:42 +0200
committerThibaut Horel <thibaut.horel@gmail.com>2015-09-24 02:06:42 +0200
commit2e5dc7e20f8aa3148f1ecd5098e24035bbfc9448 (patch)
treecd5834001fd5233cf0dad5b1385561b72739e279 /note
parent132c57058e87d92e34e072ba7e450686a87932a3 (diff)
downloadfast-seeding-master.tar.gz
Add code comments (for study plan)HEADmaster
Diffstat (limited to 'note')
-rw-r--r--note/main.tex165
1 files changed, 165 insertions, 0 deletions
diff --git a/note/main.tex b/note/main.tex
new file mode 100644
index 0000000..336aadf
--- /dev/null
+++ b/note/main.tex
@@ -0,0 +1,165 @@
+\documentclass[10pt]{article}
+\usepackage{amsmath,amsfonts,amsthm}
+\usepackage[english]{babel}
+\usepackage{paralist}
+\usepackage[utf8x]{inputenc}
+\usepackage[pagebackref=false,breaklinks=true,colorlinks=true,citecolor=blue]{hyperref}
+\usepackage[capitalize, noabbrev]{cleveref}
+\usepackage[square,sort]{natbib}
+
+
+% these are compressed lists to help fit into a 1 page limit
+\newenvironment{enumerate*}%
+ {\vspace{-2ex} \begin{enumerate} %
+ \setlength{\itemsep}{-1ex} \setlength{\parsep}{0pt}}%
+ {\end{enumerate}}
+
+\newenvironment{itemize*}%
+ {\vspace{-2ex} \begin{itemize} %
+ \setlength{\itemsep}{-1ex} \setlength{\parsep}{0pt}}%
+ {\end{itemize}}
+
+\newenvironment{description*}%
+ {\vspace{-2ex} \begin{description} %
+ \setlength{\itemsep}{-1ex} \setlength{\parsep}{0pt}}%
+ {\end{description}}
+
+\newcommand\blfootnote[1]{%
+ \begingroup
+ \renewcommand\thefootnote{}\footnote{#1}%
+ \addtocounter{footnote}{-1}%
+ \endgroup
+}
+
+
+\DeclareMathOperator*{\E}{\mathbf{E}}
+\let\Pr\relax
+\DeclareMathOperator*{\Pr}{\mathbf{P}}
+\DeclareMathOperator*{\I}{\mathcal{I}}
+\DeclareMathOperator*{\TPRev}{\textsc{TPRev}}
+\DeclareMathOperator*{\Rev}{\textsc{Rev}}
+\DeclareMathOperator*{\Val}{\textsc{Val}}
+\DeclareMathOperator*{\BRev}{\textsc{BRev}}
+\DeclareMathOperator*{\SRev}{\textsc{SRev}}
+
+\newcommand{\inprod}[1]{\left\langle #1 \right\rangle}
+\newcommand{\R}{\mathbb{R}}
+\newcommand{\M}{\mathfrak{M}}
+\newcommand{\eqdef}{\mathbin{\stackrel{\rm def}{=}}}
+\newcommand{\llbracket}{[\![}
+
+\newtheorem{theorem}{Theorem}
+\newtheorem{lemma}{Lemma}
+\newtheorem{corollary}{Corollary}
+\newtheorem*{definition}{Definition}
+\newtheorem*{goal}{Goal}
+
+\newenvironment{proofsketch}[1][Proof Sketch]{\proof[#1]\mbox{}\\*}{\endproof}
+
+\author{Thibaut Horel}
+\title{Notes on Adaptive Seeding Code}
+\date{}
+\begin{document}
+%include all references
+
+\maketitle
+
+The purpose of this document is to describe the code I wrote for the paper
+\cite{horel}. In this paper, we analyzed and evaluated adaptive strategies to
+select influential nodes in a social network. Some of the experiments required
+testing our algorithms on real social networks. Specifically, I conducted
+experiments on the Facebook social graph and the Twitter social graph. Since
+this data is not readily available at the scale required by our experiments,
+I had to resort to various distributed mining and scraping methods as explained
+below. One of the algorithms we evaluated was a distributed algorithm using the
+MapReduce framework. The list of technologies used was:
+\begin{itemize}
+ \item Python as a general scripting and ``glue'' language.
+ \item Twitter Stream and REST API for the publicly available data.
+ \item Selenium\footnote{\url{http://www.seleniumhq.org/}} with
+ PhantomJS\footnote{\url{http://phantomjs.org/}} for Twitter and Facebook scraping: Selenium
+ is a library to remote control a browser. PhantomJS is a ``headless''
+ browser (meaning that it doesn't run in graphic mode) based on Webkit.
+ Using Selenium with a fully functional web browser instead of simple HTTP
+ requests (with a tool like CURL) for scraping is crucial for websites like
+ Twitter and Facebook where a lot of the content is loaded dynamically with
+ Javascript based on the client's browser actions (like scrolling).
+ \item
+ BeautifulSoup\footnote{\url{http://www.crummy.com/software/BeautifulSoup/}}:
+ an HTML parsing library based on lxml2.
+ \item Celery\footnote{\url{http://www.celeryproject.org/}}: a Python library which provides a distributed task queue to
+ distribute tasks over several machines.
+ \item Bottle\footnote{\url{http://bottlepy.org/docs/dev/index.html}}: a Python library to build lightweights HTTP servers to allow
+ communications between the master node and client nodes in situations that
+ Celery could not accommodate.
+ \item Amazon EC2: the cloud computing platform used to distribute the data
+ mining as well as running our MapReduce algorithm.
+ \item Cython\footnote{\url{http://cython.org/}}: a tool to convert
+ ``type-annotated'' Python code into both a C library and the associated
+ Python wrapping code, to make parts of our algorithms run faster.
+\end{itemize}
+
+In what follows, the names of the files are relative to the code archive
+available at \url{http://thibaut.horel.org/fscode.tar.bz2}. As a simple metric,
+the entire project contains $\sim$3000 lines of code. Note that this document only
+describes the code, we refer the reader to the main paper for more details on
+the scientific content.
+
+\section{Twitter Data Collection}
+
+The code for this part is available in the \textsf{twitter} folder and
+comprises about 750 lines of code. The data needed was as follows:
+\begin{enumerate*}
+\item starting from a hand-selected list of popular hashtags, collect the list
+of twitter accounts which posted a tweet with one these hashtags over a given
+period of time, we call them the ``root'' accounts.
+\item for each of those accounts, collect their list of followers.
+\item for each of the followers, collect their number of followers, that is the
+number of followers of the root accounts' followers.
+\end{enumerate*}
+
+Step 1. was done using the Twitter Streaming API, the code is in
+\textsf{twitter/stream.py}. Step 2. and 3. used both Twitter's REST API and
+Scraping: indeed, we had many accounts for which to collect the list of
+followers and a short amount of time to conduct our experiments. Scraping with
+Selenium and PhantomJS was used to scrape the small accounts (with few
+followers) so as not to ``waste'' requests to the REST API for larger accounts.
+\textsf{twitter/scraper.py} contains the scraping code, while
+\textsf{twitter/api.py} contains the code making requests to the REST API.
+\textsf{twitter/dispatcher.py} dispatches the tasks for short and long accounts
+to clients. The clients run the code in \textsf{twitter/main.py}, which among
+other things contains a lightweight HTTP server which to which the tasks are
+sent, the tasks are then consumed by either the scraping or the REST API based
+on the size of the account being mined.
+
+\section{Facebook Data Collection}
+
+The data needed from Facebook was as follows:
+\begin{enumerate}
+ \item starting from a list of posts on a hand-selected list of Facebook
+ Pages, collect the list of Facebook users who liked one of these posts.
+ \item for each of these users, collect their list of friends.
+ \item for each of the friends, collect their number of friends.
+\end{enumerate}
+
+The data was exclusively collected using scraping with Selenium, PhantomJS and
+BeautifulSoup. The list of pages to scrape was added to a task queue using
+Celery. The code of the dispatcher can be found in \textsf{facebook/run.py}.
+The clients running on Amazon EC2 were connected to the task queue to obtain
+the list of pages to scrape. The code for the clients is in
+\textsf{facebook/client/}. A lightweight HTTP server written with Bottle was
+allowing the clients to dynamically request Login credentials from a rotating
+list of accounts.
+
+\section{MapReduce algorithm and Data Analysis}
+
+The implementation of the MapReduce algorithm can be found in the \textsf{mr}
+folder. The main part of the algorihtm was factored out and implemented in
+C/Python with Cython for performance reasons.
+
+The data analysis code can be found in the \textsf{analysis} folder. This part
+of the code only contains ``standard'' data processing and plotting code.
+
+\bibliographystyle{abbrvnat}
+\bibliography{main}
+\end{document}