Scribe notes for lecture 3

author: Thibaut Horel <thibaut.horel@gmail.com> 2014-09-10 11:07:23 -0400
committer: Thibaut Horel <thibaut.horel@gmail.com> 2014-09-10 11:07:23 -0400
commit: b61804836ac2e35f636540d5a170486d9f845973 (patch)
tree: 3927df58b9b7b444edea505c468167d1356ed0e5
parent: 9f52ddd6c280aaac1e649cbd047a236b666ba493 (diff)
download: cs224-b61804836ac2e35f636540d5a170486d9f845973.tar.gz
2 files changed, 307 insertions, 0 deletions
diff --git a/lecture/main.bib b/lecture/main.bib
new file mode 100644
index 0000000..354dafb
--- /dev/null
+++ b/lecture/main.bib
@@ -0,0 +1,5 @@
+@MISC{Knuth63,
+    author = {Don Knuth},
+    title = {Notes On ``Open'' Addressing},
+    year = {1963}
+}
diff --git a/lecture/main.tex b/lecture/main.tex
new file mode 100644
index 0000000..3edd05e
--- /dev/null
+++ b/lecture/main.tex
@@ -0,0 +1,302 @@
+\documentclass[11pt]{article}
+\usepackage{amsmath,amssymb,amsthm}
+
+\DeclareMathOperator*{\E}{\mathbb{E}}
+\let\Pr\relax
+\DeclareMathOperator*{\Pr}{\mathbb{P}}
+
+\newcommand{\eps}{\varepsilon}
+\newcommand{\inprod}[1]{\left\langle #1 \right\rangle}
+\newcommand{\R}{\mathbb{R}}
+\renewcommand{\H}{\mathcal{H}}
+
+\newcommand{\handout}[5]{
+  \noindent
+  \begin{center}
+  \framebox{
+    \vbox{
+      \hbox to 5.78in { {\bf CS 224: Advanced Algorithms } \hfill #2 }
+      \vspace{4mm}
+      \hbox to 5.78in { {\Large \hfill #5  \hfill} }
+      \vspace{2mm}
+      \hbox to 5.78in { {\em #3 \hfill #4} }
+    }
+  }
+  \end{center}
+  \vspace*{4mm}
+}
+
+\newcommand{\lecture}[4]{\handout{#1}{#2}{#3}{Scribe: #4}{Lecture #1}}
+
+\newtheorem{theorem}{Theorem}
+\newtheorem{corollary}[theorem]{Corollary}
+\newtheorem{lemma}[theorem]{Lemma}
+\newtheorem{observation}[theorem]{Observation}
+\newtheorem{proposition}[theorem]{Proposition}
+\newtheorem{claim}[theorem]{Claim}
+\newtheorem{fact}[theorem]{Fact}
+\newtheorem{assumption}[theorem]{Assumption}
+\theoremstyle{definition}
+\newtheorem*{example}{Example}
+\newtheorem{definition}[theorem]{Definition}
+
+% 1-inch margins, from fullpage.sty by H.Partl, Version 2, Dec. 15, 1988.
+\topmargin 0pt
+\advance \topmargin by -\headheight
+\advance \topmargin by -\headsep
+\textheight 8.9in
+\oddsidemargin 0pt
+\evensidemargin \oddsidemargin
+\marginparwidth 0.5in
+\textwidth 6.5in
+
+\parindent 0in
+\parskip 1.5ex
+
+\begin{document}
+
+\lecture{Lecture 3 --- Tuesday Sep 09, 2014}{Fall 2014}{Prof.\ Jelani Nelson}{Thibaut Horel}
+}
+\section{Overview}
+
+Today, hashing
+\begin{itemize}
+    \item load balancing
+    \item k-wise independence
+    \item dictionary (perfect hashing, linear probing)
+    \item retrieval problem (cuckoo hashing, bloomier filters)
+    \item membership, bloom filters
+    \item simple tabulation
+\end{itemize}
+
+\section{Hashing}
+
+In hashing, we have a family $\H$ of hashing  functions mapping $[u]$ to $[m]$
+(where $[u]$ denotes the set $\{0,\ldots, u-1\}$). For a set $S\subseteq [u]$
+with $|S| = n$, we want that $h$ picked at random in $\H$ behaves ``nicely'' on
+$S$.
+
+\begin{example}[load balancing]
+We have $n$ jobs that need to be assigned to $m$ machines and each job has a
+job ID in $[u]$. We want to evenly spread the job over the machines. We pick
+$h$ at random in a family of hashing functions $\H$. When a new job $j\in[u]$
+comes in, we assign it to machine $h(j)\in[m]$.  This is usually referred to as
+the ``balls and bins'' random process. 
+\end{example}
+
+We would like to say that $\Pr(\exists \text{ machine receiving more than
+$\frac{m}{n}$ jobs})$ is small. First, let us assume that the random variables
+$h(i)$ for $i\in [u]$ are independent. We will need the following concentration
+inequality.
+
+\begin{theorem}[Chernoff bound]
+    Let $X_1, \ldots, X_n$ be $n$ independent random variables with
+    $X_i\in\{0,1\}$. Let us write $X=\sum_{i=1}^n X_i$, then:
+    \begin{displaymath}
+        \Pr\big(X\geq (1+\delta)\E[X]\big)\leq
+        \left(\frac{e^\delta}{(1+\delta)^{1+\delta}}\right)^{\E[X]}.
+    \end{displaymath}
+\end{theorem}
+
+The Chernoff bound can be proved using Markov's inequality.
+
+\begin{lemma}[Markov's inequality]
+    Let $X$ be a nonnegative, random variable, then:
+    \begin{displaymath}
+        \forall\lambda > 0,\; \Pr (X\geq\lambda)\leq \frac{\E[X]}{\lambda}.
+    \end{displaymath}
+\end{lemma}
+
+\begin{proof}[Proof (Chernoff bound)]
+    By Markov's inequality, we have:
+    \begin{displaymath}
+        \Pr (X \geq \lambda) = \Pr\big(e^{tX} \geq
+        e^{t\lambda}\big)\leq\frac{\E\big[e^{tX}\big]}{e^{t\lambda}}.
+    \end{displaymath}
+    By independence:
+    \begin{displaymath}
+        \E\big[e^{tX}\big] = \prod_{i=1}^n\E\big[e^{tX_i}\big]
+        =\prod_{i=1}^n\big(p_ie^t+(1-p_i)\big)
+        \leq\prod_{i=1}^ne^{p_i(e^t-1)} = e^{(e^t-1)\E[X]},
+    \end{displaymath}
+    where the inequality uses $1+x\leq e^x$. We then conclude the proof by
+    choosing $\lambda = (1+\delta)\E[X]$ and $t=\log(1+\delta)$.
+\end{proof}
+
+Using the Chernoff bound, we can prove the following proposition.
+
+\begin{proposition}
+    For the balls and bins random process, with $m=n$, then:
+    \begin{displaymath}
+        \Pr\left(\exists \text{ machine with more than $\frac{c\log n}{\log\log
+            n}$ jobs}\right)
+            \leq \frac{1}{\mathrm{poly}(n)}.
+    \end{displaymath}
+\end{proposition}
+
+\begin{proof}
+    Focus on some machine $t$ and define:
+    \begin{displaymath}
+        X_i= \begin{cases}
+            1 &\text{if $h(i) = t$}\\
+            0&\text{otherwise}
+        \end{cases}
+    \end{displaymath}
+    Then, $X=\sum_{i=1}^n X_i$ is the number of jobs assigned to machine $t$
+    and $\E[X]=1$. Applying the Chernoff bound:
+    \begin{displaymath}
+        P\left(X\geq \frac{c\log n}{\log \log n}\right) \leq
+    \left(\frac{e}{c\log n/\log\log n}\right)^{\frac{c\log
+    n}{\log\log n}} = \left(\frac{e}{k}\right)^k\leq\frac{1}{n^c}
+    \end{displaymath}
+    where $k=\frac{c\log n}{\log \log n}$. The last inequality comes from the
+    fact that the solution to $k^k = n^c$ is $k=O(\log n/\log\log n)$. This
+    implies, by using a union bound:
+    \begin{displaymath}
+        \Pr(\exists\text{ overloaded machine})
+        \leq \sum_{t=1}^n \Pr (t\text{ is overloaded})
+        \leq \frac{1}{n^{c-1}}\qedhere
+    \end{displaymath}
+\end{proof}
+
+We propose an alternative proof which does not rely on the Chernoff bound.
+
+\begin{proof}
+    Using union bounds:
+    \begin{align*}
+    \Pr(\exists\text{ overloaded machine})
+    &\leq n \Pr(\text{machine 1 is overloaded})\\
+    &=\Pr(\exists \text{ $k$ items mapping to machine 1})\\
+    &\leq \sum_{\substack{T\subseteq[u]\\|T|=k}} \Pr (\text{items in $T$ all
+    map to 1})\\
+    &\leq\sum_{\substack{T\subseteq[u]\\|T|=k}} \frac{1}{n^k}
+    = \frac{\binom{n}{k}}{n^k}\leq \frac{1}{k!}
+    \leq\left(\frac{1}{(k/2)}\right)^{k/2}
+\end{align*}
+and we conclude as in the previous proof.
+\end{proof}
+
+Note that the second proof does not require full independence, but only that
+sets of $k$ elements are mapped to independent random variables. This motivates
+the following definitions.
+
+\begin{definition}
+    The random variables $X_1,\ldots,X_n$ are \emph{$k$-wise independent} iff
+    for any set of indices $1\leq i_1\leq\ldots\leq i_k\leq n$, the random
+    variables $X_{i_1},\ldots,X_{i_k}$ are independent.
+\end{definition}
+
+\begin{definition}
+    A set of hash functions $\H$ is a \emph{$k$-wise independent family} iff
+    the random variables $h(0), h(1),\ldots,h(u-1)$ are $k$-wise independent
+    and uniformly distributed in $[m]$ when $h\in\H$ is drawn uniformly at
+    random.
+\end{definition}
+
+\begin{example}
+    The set $\H$ of all functions from $[u]$ to $[m]$ is $k$-wise independent
+    for all $k$.
+\end{example}
+
+\begin{example}[$u=m=p$, $p$ prime] We define
+    $\H = \big\{h: h(x) = \sum_{i=0}^{k-1} a_ix^i\bmod p\big\}$ where
+    $a_i\in\{0,\ldots,p-1\}$. By using polynomial interpolation, we know that
+    for any \mbox{$(x_1,\ldots, x_k, y_1,\ldots,y_k)\in[p]^{2k}$} there exists
+    a unique polynomial $h\in\H$ such that $h(x_1) = y_1,\ldots, h(x_k) = y_k$.
+    Since there are $p^k$ elements in $\H$, if $h\in\H$ is drawn uniformly at
+    random, $\Pr(h(x_1) = y_1,\ldots,h(x_k) = y_k) = 1/p^k$, which shows that
+    $\H$ is $k$-wise independent.
+\end{example}
+
+\begin{example}[$u=p$, $p$ prime\footnote{In practice, we do not lose by
+    requiring $u$ to be prime. Indeed, by Bertrand's postulate, there always
+    exists a prime between $u$ and $2u$.}]
+    We define $\H = \big\{h: h(x) = \big(\sum_{i=0}^{k-1}a_i x^i \bmod p\big)
+    \bmod m\big\}$. We see that $\Pr(h(x_1) = y_1,\ldots, h(x_k) = y_k)\leq
+    \left(\frac{2}{m}\right)^k$. 
+\end{example}
+
+\section{Dictionary problem}
+
+This is a data structural problem with two operations:
+\begin{itemize}
+    \item \texttt{update(k,v)}: associates key $k$ with value $v$.
+    \item \texttt{query(k)}: returns value $v$ associated with key $k$ or
+        \texttt{null} if $k$ is not in the dictionary.
+\end{itemize}
+The keys and the values are assumed to live the in same universe $[u]$.
+
+\subsection{Chaining with linked lists}
+The dynamic dictionary problem can be solved by hash tables with chaining. In
+this solution the hash table is an array $A$ of size $m$. Each element in the
+array is a pointer to a linked list containing (key, value) pairs.
+
+We pick $h\in\H$ from a $2$-wise independent family. The operations on key $k$
+are made in the linked list pointed to by $A\big[h(k)\big]$.
+
+The running time of an operation on key $k$ is linear in the length of the
+linked list at $h(k)$. It is possible to show that $\E_h[\text{length of list
+at $h(k)$}]= O(1)$ if $m\geq n$.
+
+\subsection{Perfect hashing}
+
+\subsection{Linear probing}
+
+In real architectures, sequential memory accesses are much faster. A solution
+to the dictionary problem with better spacial locality is \emph{linear
+probing}. Array $A$ store $k$ at $h(k)$. If the spot is already taken, keep
+walking to the right until you find and empty spot and store the key there.
+
+Linear probing dates back at least to 1954 where it was used in an assembly
+program written by Samuel, Amdahl and Boehme. It was first analyzed in a note
+published by Knuth \cite{Knuth63} in the case where $h$ is totally random ($\H$
+is the set of all functions). There was a breakthrough in 
+
+This was suggested by (Samuel, Amdahl, Boehme, in '54) and first analyzed by
+Knuth in '63 assuming that $h$ is totally random ($\H$ is the set of all the
+functions). ($m=(1+\eps)n$, $\E(time per op) = O(1/\eps^2)$.
+
+(Pagh, Pagh, Ruzic '09): breakthrough. for any $\H$ $5$-wise independent implies $\E
+time = O(1)$ ($m\geq cn$).
+
+Patrascu, Thorup (ICALP '10) showed that $4$-wise doesn't guarantee  $O(1)$.
+
+\begin{claim}
+    If $m\geq cn$ and $\H$ is $7$-wise independent, then $\E(\text{query time})
+    = O(1)$.
+\end{claim}
+
+\begin{proof}
+How many spots to you have to touch before finding an empty spot. We are going
+to show that the expected length of a run is a constant.
+\end{proof}
+
+\begin{definition}
+    Given an interval $I$ of cells in $A$, $L(I) = |\{i: h(i)\in
+    I\}|$\footnote{we allow intervals to wrap around at the end of the table}.
+\end{definition}
+
+\begin{definition}
+    $I$ is full is $|L(I)| \geq |I|$.
+\end{definition}
+
+\begin{lemma}
+    If \texttt{query(x)} for some $x$ makes $k$ probes, then $h(x)$ is in some
+    full interval of length at least $k$.
+\end{lemma}
+
+\begin{proof}
+    $\E(time to query(x)) \leq \E(number of full intervals containing x)\leq
+    \sum_{k=1}^\infty\sum_{intervals I, |I|=k, h(x)\in I} \Pr(full)$
+    $\sum_{k=1}^\infty k\max_I \Pr(I full)$
+    we need to show that the $\max$ is $O(1/k^3)$ so that the sum converges.
+    $(m=2n)$, $n=number of diff keys active$.
+    $$\Pr(I full) \leq \Pr\big(|L(I)-\E L(I)|\geq \E L(I)\big)$$
+    we would like to use Chernoff, but we don't have independence. 
+    we can raise both sides to the power $6$.
+\end{proof}
+
+\bibliographystyle{alpha}
+\bibliography{main}
+
+\end{document}
author	Thibaut Horel <thibaut.horel@gmail.com>	2014-09-10 11:07:23 -0400
committer	Thibaut Horel <thibaut.horel@gmail.com>	2014-09-10 11:07:23 -0400
commit	b61804836ac2e35f636540d5a170486d9f845973 (patch)
tree	3927df58b9b7b444edea505c468167d1356ed0e5
parent	9f52ddd6c280aaac1e649cbd047a236b666ba493 (diff)
download	cs224-b61804836ac2e35f636540d5a170486d9f845973.tar.gz