diff options
Diffstat (limited to 'note/main.tex')
| -rw-r--r-- | note/main.tex | 165 |
1 files changed, 165 insertions, 0 deletions
diff --git a/note/main.tex b/note/main.tex new file mode 100644 index 0000000..336aadf --- /dev/null +++ b/note/main.tex @@ -0,0 +1,165 @@ +\documentclass[10pt]{article} +\usepackage{amsmath,amsfonts,amsthm} +\usepackage[english]{babel} +\usepackage{paralist} +\usepackage[utf8x]{inputenc} +\usepackage[pagebackref=false,breaklinks=true,colorlinks=true,citecolor=blue]{hyperref} +\usepackage[capitalize, noabbrev]{cleveref} +\usepackage[square,sort]{natbib} + + +% these are compressed lists to help fit into a 1 page limit +\newenvironment{enumerate*}% + {\vspace{-2ex} \begin{enumerate} % + \setlength{\itemsep}{-1ex} \setlength{\parsep}{0pt}}% + {\end{enumerate}} + +\newenvironment{itemize*}% + {\vspace{-2ex} \begin{itemize} % + \setlength{\itemsep}{-1ex} \setlength{\parsep}{0pt}}% + {\end{itemize}} + +\newenvironment{description*}% + {\vspace{-2ex} \begin{description} % + \setlength{\itemsep}{-1ex} \setlength{\parsep}{0pt}}% + {\end{description}} + +\newcommand\blfootnote[1]{% + \begingroup + \renewcommand\thefootnote{}\footnote{#1}% + \addtocounter{footnote}{-1}% + \endgroup +} + + +\DeclareMathOperator*{\E}{\mathbf{E}} +\let\Pr\relax +\DeclareMathOperator*{\Pr}{\mathbf{P}} +\DeclareMathOperator*{\I}{\mathcal{I}} +\DeclareMathOperator*{\TPRev}{\textsc{TPRev}} +\DeclareMathOperator*{\Rev}{\textsc{Rev}} +\DeclareMathOperator*{\Val}{\textsc{Val}} +\DeclareMathOperator*{\BRev}{\textsc{BRev}} +\DeclareMathOperator*{\SRev}{\textsc{SRev}} + +\newcommand{\inprod}[1]{\left\langle #1 \right\rangle} +\newcommand{\R}{\mathbb{R}} +\newcommand{\M}{\mathfrak{M}} +\newcommand{\eqdef}{\mathbin{\stackrel{\rm def}{=}}} +\newcommand{\llbracket}{[\![} + +\newtheorem{theorem}{Theorem} +\newtheorem{lemma}{Lemma} +\newtheorem{corollary}{Corollary} +\newtheorem*{definition}{Definition} +\newtheorem*{goal}{Goal} + +\newenvironment{proofsketch}[1][Proof Sketch]{\proof[#1]\mbox{}\\*}{\endproof} + +\author{Thibaut Horel} +\title{Notes on Adaptive Seeding Code} +\date{} +\begin{document} +%include all references + +\maketitle + +The purpose of this document is to describe the code I wrote for the paper +\cite{horel}. In this paper, we analyzed and evaluated adaptive strategies to +select influential nodes in a social network. Some of the experiments required +testing our algorithms on real social networks. Specifically, I conducted +experiments on the Facebook social graph and the Twitter social graph. Since +this data is not readily available at the scale required by our experiments, +I had to resort to various distributed mining and scraping methods as explained +below. One of the algorithms we evaluated was a distributed algorithm using the +MapReduce framework. The list of technologies used was: +\begin{itemize} + \item Python as a general scripting and ``glue'' language. + \item Twitter Stream and REST API for the publicly available data. + \item Selenium\footnote{\url{http://www.seleniumhq.org/}} with + PhantomJS\footnote{\url{http://phantomjs.org/}} for Twitter and Facebook scraping: Selenium + is a library to remote control a browser. PhantomJS is a ``headless'' + browser (meaning that it doesn't run in graphic mode) based on Webkit. + Using Selenium with a fully functional web browser instead of simple HTTP + requests (with a tool like CURL) for scraping is crucial for websites like + Twitter and Facebook where a lot of the content is loaded dynamically with + Javascript based on the client's browser actions (like scrolling). + \item + BeautifulSoup\footnote{\url{http://www.crummy.com/software/BeautifulSoup/}}: + an HTML parsing library based on lxml2. + \item Celery\footnote{\url{http://www.celeryproject.org/}}: a Python library which provides a distributed task queue to + distribute tasks over several machines. + \item Bottle\footnote{\url{http://bottlepy.org/docs/dev/index.html}}: a Python library to build lightweights HTTP servers to allow + communications between the master node and client nodes in situations that + Celery could not accommodate. + \item Amazon EC2: the cloud computing platform used to distribute the data + mining as well as running our MapReduce algorithm. + \item Cython\footnote{\url{http://cython.org/}}: a tool to convert + ``type-annotated'' Python code into both a C library and the associated + Python wrapping code, to make parts of our algorithms run faster. +\end{itemize} + +In what follows, the names of the files are relative to the code archive +available at \url{http://thibaut.horel.org/fscode.tar.bz2}. As a simple metric, +the entire project contains $\sim$3000 lines of code. Note that this document only +describes the code, we refer the reader to the main paper for more details on +the scientific content. + +\section{Twitter Data Collection} + +The code for this part is available in the \textsf{twitter} folder and +comprises about 750 lines of code. The data needed was as follows: +\begin{enumerate*} +\item starting from a hand-selected list of popular hashtags, collect the list +of twitter accounts which posted a tweet with one these hashtags over a given +period of time, we call them the ``root'' accounts. +\item for each of those accounts, collect their list of followers. +\item for each of the followers, collect their number of followers, that is the +number of followers of the root accounts' followers. +\end{enumerate*} + +Step 1. was done using the Twitter Streaming API, the code is in +\textsf{twitter/stream.py}. Step 2. and 3. used both Twitter's REST API and +Scraping: indeed, we had many accounts for which to collect the list of +followers and a short amount of time to conduct our experiments. Scraping with +Selenium and PhantomJS was used to scrape the small accounts (with few +followers) so as not to ``waste'' requests to the REST API for larger accounts. +\textsf{twitter/scraper.py} contains the scraping code, while +\textsf{twitter/api.py} contains the code making requests to the REST API. +\textsf{twitter/dispatcher.py} dispatches the tasks for short and long accounts +to clients. The clients run the code in \textsf{twitter/main.py}, which among +other things contains a lightweight HTTP server which to which the tasks are +sent, the tasks are then consumed by either the scraping or the REST API based +on the size of the account being mined. + +\section{Facebook Data Collection} + +The data needed from Facebook was as follows: +\begin{enumerate} + \item starting from a list of posts on a hand-selected list of Facebook + Pages, collect the list of Facebook users who liked one of these posts. + \item for each of these users, collect their list of friends. + \item for each of the friends, collect their number of friends. +\end{enumerate} + +The data was exclusively collected using scraping with Selenium, PhantomJS and +BeautifulSoup. The list of pages to scrape was added to a task queue using +Celery. The code of the dispatcher can be found in \textsf{facebook/run.py}. +The clients running on Amazon EC2 were connected to the task queue to obtain +the list of pages to scrape. The code for the clients is in +\textsf{facebook/client/}. A lightweight HTTP server written with Bottle was +allowing the clients to dynamically request Login credentials from a rotating +list of accounts. + +\section{MapReduce algorithm and Data Analysis} + +The implementation of the MapReduce algorithm can be found in the \textsf{mr} +folder. The main part of the algorihtm was factored out and implemented in +C/Python with Cython for performance reasons. + +The data analysis code can be found in the \textsf{analysis} folder. This part +of the code only contains ``standard'' data processing and plotting code. + +\bibliographystyle{abbrvnat} +\bibliography{main} +\end{document} |
