\documentclass{article} \title{Information Theoretic Analysis of the Complexity of Comparison-based Sorting} \author{Thibaut Horel} \usepackage[utf8]{inputenc} \usepackage{amsmath, amsfonts, amsthm} \newcommand{\set}[1]{\mathbf{#1}} \newcommand{\gx}{\mathcal{X}} \newcommand{\bx}{\boldsymbol{x}} \newcommand{\fa}{\phi} \newcommand*{\defeq}{\equiv} \newtheorem*{lemma}{Lemma} \newtheorem*{theorem}{Theorem} \newtheorem*{proposition}{Proposition} \theoremstyle{remark} \newtheorem*{remark}{Remark} \let\P\relax \DeclareMathOperator{\P}{\mathbb{P}} \DeclareMathOperator{\E}{\mathbb{E}} \usepackage[pagebackref=true,breaklinks=true,colorlinks=true]{hyperref} \begin{document} \maketitle We give an information-theoretic analysis of the complexity of comparison-based sorting algorithms. We first give a proof of the standard $n\log n$ lower-bound on the worst-case complexity of sorting. While conceptually equivalent to the ``standard proof'', we show how the information-theoretic framework provides a simple and unified approach which naturally generalizes to average-case complexity and randomized algorithms. \paragraph{Model.} Let $\mathfrak{S}(E)$ be the symmetric group of the permutations on a set $E$ of $n$ elements. We assume that $E$ is totally ordered and write $e_1,\dots, e_n$ its elements in sorted order. A comparison-based algorithm is given as input a permuted array $[\sigma(e_1), \dots,\linebreak \sigma(e_n)]$ for some $\sigma\in\mathfrak{S}(E)$ and is allowed to compare any two cells of this array. Formally, the algorithm has access to $\sigma$ through \emph{comparison queries}; for $i, j\in\{1,\dots,n\}^2$, the output of query $q(i, j)$ is given by: \begin{displaymath} q(i, j) = \begin{cases} 1&\text{if } \sigma(e_i)\leq \sigma(e_j)\\ 0&\text{otherwise} \end{cases} \end{displaymath} The output of the algorithm is the sorted array $[e_1,\dots,e_n]$. Note that the input of the algorithm by itself is not a complete representation of $\sigma$, but the pair (input, output) is. In other words, the algorithm is discovering the permutation $\sigma$ through comparison queries. This point of view is important to understand the information-theoretic approach. \paragraph{Worst-case complexity.} For a comparison-based sorting algorithm $A$ and $\sigma\in\mathfrak{S}(E)$ we denote by $c(A, \sigma)$ the number of comparison queries performed by $A$ on input $\sigma$\footnote{Note that the number of comparisons can depend on the input, for example a bubble sort on sorted input only requires $n$ comparisons, but requires $\Omega(n^2)$ comparisons in the worst case.}. The worst-case complexity $c_A$ of $A$ is defined by $c_A\defeq\max_{\sigma\in\mathfrak{S}(E)} c(A,\sigma)$. \begin{proposition} For any deterministic comparison-based sorting algorithm $A$, we have $c_A = \Omega(n\log n)$. \end{proposition} \begin{proof} Let us consider a random variable $X$ uniform over $\mathfrak{S}(E)$ and let us denote by $Y=(Y_1,\dots,Y_{c(A,X)})$ the output of the queries performed by $A$. Note that since we have complete knowledge of $\sigma$ at the end of the execution of $A$, the conditional entropy $H(X|Y)$ is $0$. Hence we can write: \begin{align} 0 = H(X|Y) &= H(X, Y) - H(Y) \\ &=H(X) - H(Y)\label{eq:2}\\ &\geq H(X) - c_A\label{eq:3}\\ &=\log(n!) - c_A \label{eq:4} \end{align} where \eqref{eq:2} follows from the fact that $Y$ is a deterministic function of $X$ and \eqref{eq:4} follows from the fact that $X$ is uniform over a set of size $n!$. \eqref{eq:3} follows from the bound $H(Y)\leq c_A$: each query only reveals one bit of information and there are at most $c_A$ queries. This last argument can be made formal: \begin{align} H(Y) &= \sum_{k\geq 0} H(Y| c(A, X) = k)\P[c(A, X) = k]\\ &=\sum_{k\geq 0} \P[c(A, X) = k] H(Y_1,\dots, Y_k)\\ &\leq\sum_{k\geq 0} \P[c(A, X) = k]\sum_{i=1}^k H(Y_i)\label{eq:7}\\ &\leq\sum_{k\geq 0} k\cdot\P[c(A, X) = k]\label{eq:8}\\ &\leq c_A\label{eq:9} \end{align} where \eqref{eq:7} is a standard inequality (the joint entropy is maximal for independent variables) and \eqref{eq:8} follows from $H(Y_i)\leq 1$ (since $Y_i$ takes values in $\{0, 1\}$). Finally, \eqref{eq:9} simply follows form the definition of worst-case complexity. We then conclude the proof by rewriting \eqref{eq:4} as $c_A\geq \log(n!)$ and applying a standard Stirling approximation of the factorial. \end{proof} \begin{remark} The above proof is written in a verbose way to make sure there is no confusion and to facilitate further generalizations, but the intuition is very simple: we gain $\log(n!)$ bits of information in $c_A$ queries and each query gives at most $1$ bit of information, hence $c_A\geq \log(n!)$. \end{remark} \paragraph{Average-case complexity.} The average complexity of a comparison-based sorting algorithm $A$ is defined by $C_A\defeq \E[c(A, X)]$ where $X$ is a uniform random variable over $\mathfrak{S}(E)$. We have the trivial relationship between average-case and worst-case complexity: $c_A\geq C_A$ since the expectation of a random variable is at most its supremum. Perhaps surprisingly though, the average complexity of sorting is no smaller than its worst-case complexity. \begin{proposition} For any deterministic comparison-based sorting algorithm $A$, we have $C_A = \Omega(n\log n)$. \end{proposition} \begin{proof} The proof is exactly the same as the previous proof except that we observe that \eqref{eq:8} is exactly providing the bound $H(Y)\leq C_A$. In other words, the previous proof was already a proof of the average-case complexity, implicitly using the fact that $c_A\geq C_A$. \end{proof} \paragraph{Randomized algorithms.} We now consider the case where $A$ can be a randomized algorithm, \emph{i.e} a distribution over deterministic algorithms. This implies that the queries $Y_1, \dots, Y_{c(A, X)}$ made by $A$ on output $X$ are no longer a deterministic function of $X$. The complexity of $A$ is then similarly defined by $C_A \defeq \E[c(A, X)]$ where the expectation is taken over both the randomness of $A$ and the randomness of $X$ (uniform over $\mathfrak{S}(E)$). \begin{remark} One could also define the \emph{worst-case expected complexity} $C_A'=\linebreak\max_{\sigma}\E[c(A, \sigma)]$, but as before we have $C_A'\geq C_A$. As a consequence, a lower-bound on $C_A$ (randomized algorithm and random input) is, in some sense, the strongest result we can achieve. \end{remark} \begin{proposition} For a (possibly randomized) comparison-based sorting algorithm $A$, we have $C_A=\Omega(n\log n)$. \end{proposition} \begin{proof} Again, the proof is exactly the one we already saw. The only difference is that \eqref{eq:2} is now an inequality: $Y$ is no longer a deterministic function of $X$ so we have $H(X,Y)\geq H(X)$. \end{proof} \paragraph{TODO.} Talk about the case where $E$ is a multiset, i.e. some keys can be repeated, entropy-optimal sorting, 3-way quicksort, etc. \end{document}