\documentclass[12pt]{article} \usepackage[T1]{fontenc} \usepackage[utf8]{inputenc} \usepackage[hmargin=1in, vmargin=1in]{geometry} \title{Review for \emph{Optimal Pricing in E-Commerce Based on Sparse and Noisy Data}} \date{} \begin{document} \maketitle \paragraph{Problem and results.} In this paper, the authors study the problem of determining the optimal price of products in the context of online stores. The goal is to use past data (consisting of prices and the associated mean daily revenue/profit) to propose new (approximately) optimal prices. The main challenges are that the data is sparse (some products have few data points) and noisy (mean daily revenue varies a lot based on exogenous factors), and that the function to optimize (revenue as a function of the chosen price) is both unknown and complex with possibly multiple local optima. The proposed method proceeds as follows: \begin{itemize} \item use bootstrap estimation to build an estimator of the probability that an observed price in the dataset is optimal. \item use kernel regression to extrapolate this function to all prices: this steps gives a way to compute for any (possibly unobserved) price, the probability that it is optimal. \item use additional data (\emph{e.g. products from the same category or competitor prices}) to build a prior distribution on the probability that a price is optimal. The model used for this are decision trees. \item use the Metropolis-Hastings to sample prices from the posterior distribution. \end{itemize} After describing their method, the authors present experimental results, both simulation-based and also based on a deployment on a large European E-commerce website and show that the new method performs better than baselines and also led to significant increases in the revenue of the company it was deployed at. \paragraph{Comments on the proposed method.} The proposed method uses fairly standard ideas from statistics/machine learning, but which apparently haven't been used until now in the context of optimal pricing in E-commerce. I am not knowledgeable about the E-commerce literature, so I am basing this observation on the related work section. Each step is properly motivated by the authors and tailored to the specific challenges and features of the problem at hand. However, my main concerns regarding the method are the following: \begin{itemize} \item Instead of directly optimizing the (unknown) revenue/profit function, the authors choose to focus on maximizing the probability that a given price is optimal. However, directly estimating the probability that a price is optimal is unfeasible, so instead, the method optimizes the probability that a given price leads to a revenue (or a combination of revenue and profit) above some threshold, as defined in Equation (5). It is really unclear what this achieves: in particular, the algorithm could get stuck in selecting suboptimal prices for which it is very confident, and fail to explore prices which could be better but for which the confidence is very low at the moment (see also my comment below about exploration/exploitation trade-off). \item Related to the previous point, very little indication is given about the choice of the threshold introduced in Equation (5). This is problematic since if it is chosen too low, then the algorithm could get stuck in suboptimal minima. The authors indicate that the threshold can be chosen based on previously observed data points, which I think heightens the problem of getting stuck before performing sufficient exploration. \item The paper mentions ``dynamic pricing'' several times, and this is indeed a key feature of the problem at hand. However, the proposed method seems highly static to me: the main algorithm considers all data points available so far and computes prices for the next day (or next time period); this needs to be rerun whenever new prices are needed. It could be nice to show how this procedure can be turned into a feedback loop where new data points are efficiently incorporated. For example, is it the case that the probabilities can be efficiently updated when new data points are observed? Is the approach fully Bayesian in the sense that the computed probabilities are used as the prior distribution for the next step, or does the method always start with the prior distribution given by the decision trees? \item Even if each step is fairly standard and theoretically well understood, the entire procedure combines them in a way which is hard to reason about. As mentioned before, the threshold in (5) depends on the observed data and seem to lead to insufficient exploration. This could be counterbalanced by the proposal distribution in the Metropolis-Hastings algorithm, but it is unclear what the overall procedure converges to (see my comment below about the asbence of theoretical results). \end{itemize} \paragraph{Absence of theoretical results.} Related to the last point above, it is regrettable that the proposed method lacks theoretical guarantees. Even though it is known that each individual step converges/is optimal in the limit of infinitly many data points, the steps are combined in a way which makes it hard to be convinced that the entire procedure converges to a good solution in the limit. \paragraph{Additional remark on the overall approach.} At a high-level, the problem is an instance of the following problem: optimizing an unknown function based on noisy observtions of its values. This is a well-studied problem whose main challenge is the so-called ``exploration-exploitation'' trade-off: methods for this problem need to balance exploration (to learn enough about the structure of the objective function) and exploitation (to make good decisions with respect to what is currently known). Important keywords relevant to this problem and which roughly cover this area of research: bandit algorithms, bayesian bandits, online optimization, regret minimization, Thompson sampling, derivative-free optimization under noise, optimization with bandit feedback. Many solutions proposed in this line of work share some similarity with the method proposed in this paper, but do so in a more principled way. I think the authors should at the very least discuss extensively the need to design a new ad hoc solution and how it compares to the above line of work. I would also recommend trying to frame their problem using one of the above frameworks: for example, the framework of regret minimization seems very well suited to describe the problem of dynamic pricing where decisions are made over time based on partial observations. \paragraph{Comments on the experimental results.} The experiments seem sound and well-designed and the proposed method does seem to perform significantly better than baselines. Maybe more details should be given about how and to which value the various meta-parameters where chosen: in particular, the boostrap parameter $B$, the kernel regression parameter $h$, the threshold in equation (5), the $\lambda$ on page 13, how the decision trees where trained. \paragraph{Comments on the writing.} The paper is well written overall with a structure which is easy to follow. A few points which could improve the exposition: \begin{itemize} \item the caption of Figure 1 page 4 should have one sentence explaining that the curve is obtained via extrapolation with kernel regression (and refer to the later section where it is explained). \item section 3.4.1 on bootstrap is heavy and not very useful in my opinion. It should be assumed that readers are somewhat familiar with boostrap estimation. I would reduce this section to one paragraph with maybe one or two sentences on the general principle of bootsrap estimation and then a clear explanation of the form: ``we apply boostrap estimation on data X to estimate Y, this will be then used to do Z''. \item the description of Algorithm 1 could be significantly simplified by removing all uses of the $\hat{F}_{boot}$ distribution: here the algorithm only uses an empirical distribution to estimate a probability. All which is really done is counting the fraction of bootstrap samples whose mean target value is above the threshold $\tilde{y}$. This can be explained in one sentence in English, but is considerably obfuscated by the way Algorithm 1 is written. Furthermore, I think that line 9 in Algorithm 1 contains a typo, I think it should read: ``Solve $\hat{q}_n = \hat{F}_{boot}(\tilde{y})$''. But again, I would recommend not using $\hat{F}_{boot}$ altogether. \item section 3.5.2 is similarly heavy: it should be assumed that the reader is familiar with the concept of prior distribution. For example, I would remove the two paragraphs below Equation (8). \end{itemize} \paragraph{Conclusion.} While the problem is interesting and the proposed method works well experimentally, I cannot recommend the paper for publication in its current form: the method is not particularly novel and combines fairly standard techniques in an ad hoc way which makes it hard to reason about convergence, whereas a well-established area of research already contains many tools which could lead to a more principled solution to the problem. \end{document}