\documentclass[12pt]{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[hmargin=1in, vmargin=1in]{geometry}

\title{Review for \emph{Optimal Pricing in E-Commerce Based on Sparse and Noisy Data}}
\date{}
\begin{document}
\maketitle

\paragraph{Problem and results.} In this paper, the authors study the problem
of determining the optimal price of products in the context of online stores.
The goal is to use past data (consisting of prices and the associated mean daily
revenue/profit) to propose new (approximately) optimal prices.  The main
challenges are that the data is sparse (some products have few
data points) and noisy (mean daily revenue varies a lot based on exogenous
factors), and that the function to optimize (revenue as a function of the
chosen price) is both unknown and complex with possibly multiple local optima.
The proposed method proceeds as follows:
\begin{itemize}
	\item use bootstrap estimation to build an estimator of the probability
		that an observed price in the dataset is optimal.
	\item use kernel regression to extrapolate this function to all prices:
		this steps gives a way to compute for any (possibly unobserved) price,
		the probability that it is optimal.
	\item use additional data (\emph{e.g. products from the same category or
		competitor prices}) to build a prior distribution on the probability
		that a price is optimal. The model used for this are decision trees.
	\item use the Metropolis-Hastings to sample prices from the posterior
		distribution.
\end{itemize}
After describing their method, the authors present experimental results, both
simulation-based and also based on a deployment on a large European E-commerce
website and show that the new method performs better than baselines and also
led to significant increases in the revenue of the company it was deployed at.

\paragraph{Comments on the proposed method.} The proposed method uses fairly
standard ideas from statistics/machine learning, but which apparently haven't
been used until now in the context of optimal pricing in E-commerce. I am not
knowledgeable about the E-commerce literature, so I am basing this observation
on the related work section. Each step is properly motivated by the authors and
tailored to the specific challenges and features of the problem at hand.
However, my main concerns regarding the method are the following:
\begin{itemize}
	\item Instead of directly optimizing the (unknown) revenue/profit function,
		the authors choose to focus on maximizing the probability that a given
		price is optimal. However, directly estimating the probability that
		a price is optimal is unfeasible, so instead, the method optimizes the
		probability that a given price leads to a revenue (or a combination of
		revenue and profit) above some threshold, as defined in Equation (5).
		It is really unclear what this achieves: in particular, the algorithm
		could get stuck in selecting suboptimal prices for which it is very
		confident, and fail to explore prices which could be better but for
		which the confidence is very low at the moment (see also my comment
		below about exploration/exploitation trade-off).
	\item Related to the previous point, very little indication is given about
		the choice of the threshold introduced in Equation (5). This is
		problematic since if it is chosen too low, then the
		algorithm could get stuck in suboptimal minima. The authors indicate
		that the threshold can be chosen based on previously observed data
		points, which I think heightens the problem of getting stuck before
		performing sufficient exploration. 
	\item The paper mentions ``dynamic pricing'' several times, and
		this is indeed a key feature of the problem at hand. However, the
		proposed method seems highly static to me: the main algorithm considers
		all data points available so far and computes prices for the next day
		(or next time period); this needs to be rerun whenever new prices are
		needed. It could be nice to show how this procedure can be turned into
		a feedback loop where new data points are efficiently incorporated. For
		example, is it the case that the probabilities can be efficiently
		updated when new data points are observed? Is the approach fully
		Bayesian in the sense that the computed probabilities are used as the
		prior distribution for the next step, or does the method always start
		with the prior distribution given by the decision trees?
	\item Even if each step is fairly standard and theoretically well
		understood, the entire procedure combines them in a way which is hard
		to reason about. As mentioned before, the threshold in (5) depends on
		the observed data and seem to lead to insufficient exploration. This
		could be counterbalanced by the proposal distribution in the
		Metropolis-Hastings algorithm, but it is unclear what the
		overall procedure converges to (see my comment below about the
		asbence of theoretical results).
\end{itemize}

\paragraph{Absence of theoretical results.} Related to the last point above, it
is regrettable that the proposed method lacks theoretical guarantees.  Even
though it is known that each individual step converges/is optimal in the limit
of infinitly many data points, the steps are combined in a way which makes it
hard to be convinced that the entire procedure converges to a good solution in
the limit.

\paragraph{Additional remark on the overall approach.} At a high-level, the
problem is an instance of the following problem: optimizing an
unknown function based on noisy observtions of its values. This is
a well-studied problem whose main challenge is the so-called
``exploration-exploitation'' trade-off: methods for this problem need to
balance exploration (to learn enough about the structure of the objective
function) and exploitation (to make good decisions with respect to what is
currently known). Important keywords relevant to this problem and which roughly
cover this area of research: bandit algorithms, bayesian bandits, online
optimization, regret minimization, Thompson sampling, derivative-free
optimization under noise, optimization with bandit feedback. Many solutions
proposed in this line of work share some similarity with the method proposed in
this paper, but do so in a more principled way. I think the authors should at
the very least discuss extensively the need to design a new ad hoc solution and
how it compares to the above line of work. I would also recommend trying to
frame their problem using one of the above frameworks: for example, the
framework of regret minimization seems very well suited to describe the problem
of dynamic pricing where decisions are made over time based on partial
observations.

\paragraph{Comments on the experimental results.} The experiments seem sound
and well-designed and the proposed method does seem to perform significantly
better than baselines. Maybe more details should be given about how and to
which value the various meta-parameters where chosen: in particular, the
boostrap parameter $B$, the kernel regression parameter $h$, the threshold in
equation (5), the $\lambda$ on page 13, how the decision trees where trained.

\paragraph{Comments on the writing.} The paper is well written overall with
a structure which is easy to follow. A few points which could improve the
exposition:
\begin{itemize}
\item the caption of Figure 1 page 4 should have one sentence explaining that
	the curve is obtained via extrapolation with kernel regression (and refer
		to the later section where it is explained).
\item section 3.4.1 on bootstrap is heavy and not very useful in my opinion.
	It should be assumed that readers are somewhat familiar with
		boostrap estimation. I would reduce this section to one paragraph with
		maybe one or two sentences on the general principle of bootsrap
		estimation and then a clear explanation of the form: ``we apply
		boostrap estimation on data X to estimate Y, this will be then used to
		do Z''.
\item the description of Algorithm 1 could be significantly simplified by
	removing all uses of the $\hat{F}_{boot}$ distribution:  here the
		algorithm only uses an empirical distribution to estimate
		a probability. All which is really done is counting the fraction of
		bootstrap samples whose mean target value is above the threshold
		$\tilde{y}$. This can be explained in one sentence in English, but is
		considerably obfuscated by the way Algorithm 1 is written. Furthermore,
		I think that line 9 in Algorithm 1 contains a typo, I think it should
		read: ``Solve $\hat{q}_n = \hat{F}_{boot}(\tilde{y})$''. But again,
		I would recommend not using $\hat{F}_{boot}$ altogether.
\item section 3.5.2 is similarly heavy: it should be assumed that the reader is
	familiar with the concept of prior distribution. For example, I would
		remove the two paragraphs below Equation (8).
\end{itemize}

\paragraph{Conclusion.} While the problem is interesting and the proposed
method works well experimentally, I cannot recommend the paper for publication
in its current form: the method is not particularly novel and combines fairly
standard techniques in an ad hoc way which makes it hard to reason about
convergence, whereas a well-established area of research already contains many
tools which could lead to a more principled solution to the problem.

\end{document}