1 files changed, 165 insertions, 0 deletions
diff --git a/note/main.tex b/note/main.tex
new file mode 100644
index 0000000..336aadf
--- /dev/null
+++ b/note/main.tex
@@ -0,0 +1,165 @@
+\documentclass[10pt]{article}
+\usepackage{amsmath,amsfonts,amsthm}
+\usepackage[english]{babel}
+\usepackage{paralist} 
+\usepackage[utf8x]{inputenc}
+\usepackage[pagebackref=false,breaklinks=true,colorlinks=true,citecolor=blue]{hyperref}
+\usepackage[capitalize, noabbrev]{cleveref}
+\usepackage[square,sort]{natbib}
+
+
+% these are compressed lists to help fit into a 1 page limit
+\newenvironment{enumerate*}%
+  {\vspace{-2ex} \begin{enumerate} %
+     \setlength{\itemsep}{-1ex} \setlength{\parsep}{0pt}}%
+  {\end{enumerate}}
+ 
+\newenvironment{itemize*}%
+  {\vspace{-2ex} \begin{itemize} %
+     \setlength{\itemsep}{-1ex} \setlength{\parsep}{0pt}}%
+  {\end{itemize}}
+ 
+\newenvironment{description*}%
+  {\vspace{-2ex} \begin{description} %
+     \setlength{\itemsep}{-1ex} \setlength{\parsep}{0pt}}%
+  {\end{description}}
+
+\newcommand\blfootnote[1]{%
+  \begingroup
+  \renewcommand\thefootnote{}\footnote{#1}%
+  \addtocounter{footnote}{-1}%
+  \endgroup
+}
+
+
+\DeclareMathOperator*{\E}{\mathbf{E}}
+\let\Pr\relax
+\DeclareMathOperator*{\Pr}{\mathbf{P}}
+\DeclareMathOperator*{\I}{\mathcal{I}}
+\DeclareMathOperator*{\TPRev}{\textsc{TPRev}}
+\DeclareMathOperator*{\Rev}{\textsc{Rev}}
+\DeclareMathOperator*{\Val}{\textsc{Val}}
+\DeclareMathOperator*{\BRev}{\textsc{BRev}}
+\DeclareMathOperator*{\SRev}{\textsc{SRev}}
+
+\newcommand{\inprod}[1]{\left\langle #1 \right\rangle}
+\newcommand{\R}{\mathbb{R}}
+\newcommand{\M}{\mathfrak{M}}
+\newcommand{\eqdef}{\mathbin{\stackrel{\rm def}{=}}}
+\newcommand{\llbracket}{[\![}
+
+\newtheorem{theorem}{Theorem}
+\newtheorem{lemma}{Lemma}
+\newtheorem{corollary}{Corollary}
+\newtheorem*{definition}{Definition}
+\newtheorem*{goal}{Goal}
+
+\newenvironment{proofsketch}[1][Proof Sketch]{\proof[#1]\mbox{}\\*}{\endproof}
+
+\author{Thibaut Horel}
+\title{Notes on Adaptive Seeding Code}
+\date{}
+\begin{document}
+%include all references
+
+\maketitle
+
+The purpose of this document is to describe the code I wrote for the paper
+\cite{horel}. In this paper, we analyzed and evaluated adaptive strategies to
+select influential nodes in a social network. Some of the experiments required
+testing our algorithms on real social networks. Specifically, I conducted
+experiments on the Facebook social graph and the Twitter social graph. Since
+this data is not readily available at the scale required by our experiments,
+I had to resort to various distributed mining and scraping methods as explained
+below. One of the algorithms we evaluated was a distributed algorithm using the
+MapReduce framework. The list of technologies used was:
+\begin{itemize}
+    \item Python as a general scripting and ``glue'' language.
+    \item Twitter Stream and REST API for the publicly available data.
+    \item Selenium\footnote{\url{http://www.seleniumhq.org/}} with
+    PhantomJS\footnote{\url{http://phantomjs.org/}} for Twitter and Facebook scraping: Selenium
+    is a library to  remote control a browser. PhantomJS is a ``headless''
+    browser (meaning that it doesn't run in graphic mode) based on Webkit.
+    Using Selenium with  a fully functional web browser instead of simple HTTP
+    requests (with a tool like CURL) for scraping is crucial for websites like
+    Twitter and Facebook where a lot of the content is loaded dynamically with
+    Javascript based on the client's browser actions (like scrolling).
+    \item
+    BeautifulSoup\footnote{\url{http://www.crummy.com/software/BeautifulSoup/}}:
+    an HTML parsing library based on lxml2.
+    \item Celery\footnote{\url{http://www.celeryproject.org/}}: a Python library which provides a distributed task queue to
+    distribute tasks over several machines.
+    \item Bottle\footnote{\url{http://bottlepy.org/docs/dev/index.html}}: a Python library to build lightweights HTTP servers to allow
+    communications between the master node and client nodes in situations that
+    Celery could not accommodate.
+    \item Amazon EC2: the cloud computing platform used to distribute the data
+    mining as well as running our MapReduce algorithm.
+    \item Cython\footnote{\url{http://cython.org/}}: a tool to convert
+    ``type-annotated'' Python code into both a C library and the associated
+    Python wrapping code, to make parts of our algorithms run faster.
+\end{itemize}
+
+In what follows, the names of the files are relative to the code archive
+available at \url{http://thibaut.horel.org/fscode.tar.bz2}. As a simple metric,
+the entire project contains $\sim$3000 lines of code. Note that this document only
+describes the code, we refer the reader to the main paper for more details on
+the scientific content.
+
+\section{Twitter Data Collection}
+
+The code for this part is available in the \textsf{twitter} folder and
+comprises about 750 lines of code. The data needed was as follows:
+\begin{enumerate*}
+\item starting from a hand-selected list of popular hashtags, collect the list
+of twitter accounts which posted a tweet with one these hashtags over a given
+period of time, we call them the ``root'' accounts.
+\item for each of those accounts, collect their list of followers.
+\item for each of the followers, collect their number of followers, that is the
+number of followers of the root accounts' followers.
+\end{enumerate*}
+
+Step 1. was done using the Twitter Streaming API, the code is in
+\textsf{twitter/stream.py}. Step 2. and 3. used both Twitter's REST API and
+Scraping: indeed, we had many accounts for which to collect the list of
+followers and a short amount of time to conduct our experiments. Scraping with
+Selenium and PhantomJS was used to scrape the small accounts (with few
+followers) so as not to ``waste'' requests to the REST API for larger accounts.
+\textsf{twitter/scraper.py} contains the scraping code, while
+\textsf{twitter/api.py} contains the code making requests to the REST API.
+\textsf{twitter/dispatcher.py} dispatches the tasks for short and long accounts
+to clients. The clients run the code in \textsf{twitter/main.py}, which among
+other things contains a lightweight HTTP server which to which the tasks are
+sent, the tasks are then consumed by either the scraping or the REST API based
+on the size of the account being mined.
+
+\section{Facebook Data Collection}
+
+The data needed from Facebook was as follows:
+\begin{enumerate}
+    \item starting from a list of posts on a hand-selected list of Facebook
+    Pages, collect the list of Facebook users who liked one of these posts.
+    \item for each of these users, collect their list of friends.
+    \item for each of the friends, collect their number of friends.
+\end{enumerate}
+
+The data was exclusively collected using scraping with Selenium, PhantomJS and
+BeautifulSoup. The list of pages to scrape was added to a task queue using
+Celery. The code of the dispatcher can be found in \textsf{facebook/run.py}.
+The clients running on Amazon EC2 were connected to the task queue to obtain
+the list of pages to scrape. The code for the clients is in
+\textsf{facebook/client/}. A lightweight HTTP server written with Bottle was
+allowing the clients to dynamically request Login credentials from a rotating
+list of accounts.
+
+\section{MapReduce algorithm and Data Analysis}
+
+The implementation of the MapReduce algorithm can be found in the \textsf{mr}
+folder. The main part of the algorihtm was factored out and implemented in
+C/Python with Cython for performance reasons.
+
+The data analysis code can be found in the \textsf{analysis} folder. This part
+of the code only contains ``standard'' data processing and plotting code.
+
+\bibliographystyle{abbrvnat}
+\bibliography{main}
+\end{document}