1 Introduction

Non-parametric statistics [53] has broad applications in density estimation [30], regression [25] and many engineering applications such as image classification [38] and natural language processing [50]. Due to its distribution-free and predictive characteristics, non-parametric estimators are imbued with predictive power as well as mild requirements on oracles for the probability distribution. Recently, the study of stochastic programming (SP) models with predictor/response data has begun to attract attention due to modern applications with evolving/streaming data [3, 10, 14, 26, 29, 31,32,33, 41, 42]. One streaming data example could be real-time renewable power generation (wind and solar), which can be guided by real-time demand data as well as real-time power production from renewable resources. Following Deng and Sen [10], we refer to such SP models as Predictive SP (PSP), which hold the promise of providing greater agility and adaptivity for SP models. While the predictive feature adds agility to decision-making, there is one particularly insidious aspect of such models: simply averaging finitely-many functions with equal weights (as in Sample Average Approximation (SAA)) will, in general, produce a constant bias that may not reduce even when the sample size increases indefinitely (see Bertsimas and Kallus [3] for more details).

To set the stage for our development, we consider the PSP problem where a vector of random model parameters, \({\tilde{\xi }}\), depends on a predictor (random variable), \({{\tilde{\omega }}}\), and such predictors can be observed before the realization of \({{\tilde{\xi }}}\). For instance, \({\tilde{\omega }}\) is a random vector of temperature and humidity, while \({\tilde{\xi }}\) denotes precipitation [43].

In order to put the discussion on a firm mathematical footing, we first introduce the mathematical setting. Let \({\mathcal {X}}\) be a Borel subset of \({\mathbb {R}}^{m_1}\) and let \({\mathcal {Y}}\) be a Borel subset of \({\mathbb {R}}^{m_2}\). Let \((\Omega , \Sigma _{\Omega }, {\mathbb {P}})\) be the probability space of the correlated continuous random variables \({\tilde{\omega }}\) and \({\tilde{\xi }}\) taking the values in the measurable spaces, \(({\mathcal {X}}, \Sigma _{{\mathcal {X}}})\) and \(({\mathcal {Y}}, \Sigma _{{\mathcal {Y}}})\), respectively. In particular, the tuple \(({\tilde{\omega }},{\tilde{\xi }}): \Omega \mapsto {\mathcal {X}} \times {\mathcal {Y}}\) is a random variable taking values in the product space \(({\mathcal {X}} \times {\mathcal {Y}}, \Sigma _{{\mathcal {X}}} \otimes \Sigma _{\mathcal {Y}})\). While the underlying distribution of predictor-response pairs \(({\tilde{\omega }},{\tilde{\xi }})\) will not be assumed to be known, we will assume that a dataset of outcomes \(S^{{\mathcal {N}}} \triangleq \{(\omega _1,\xi _1), (\omega _2,\xi _2), \ldots , (\omega _{{\mathcal {N}}}, \xi _{\mathcal {N}})\}\) is available, and more observations become available as \({\mathcal {N}} \rightarrow \infty \). Moreover, the regular conditional distribution of \({\tilde{\xi }}\) given \({\tilde{\omega }} = \omega \), denoted \(K(\omega ,\cdot )\), is assumed to existFootnote 1 and,

$$\begin{aligned} f(x, \omega ) \triangleq {\mathbb {E}}_{{\tilde{\xi }} \sim K(\omega ,\cdot )}[F(x,{\tilde{\xi }}) \vert {\tilde{\omega }} = \omega ] \triangleq \int _{{\mathcal {Y}}} F(x,\xi ) K(\omega ,d \xi ) \end{aligned}$$
(1)

is well defined and finite valued for any \(x \in X\). In ordinary non-parametric regression, we fix x and \(\omega \) and then use \(\{(F(x,\xi _1),\omega _1), \ldots , (F(x,\xi _{{\mathcal {N}}}),\omega _{{\mathcal {N}}} )\}\) from \({\mathcal {S}}^{{\mathcal {N}}}\) to construct a non-parametric regression of \({\mathbb {E}}_{{\tilde{\xi }} \sim K(\omega ,\cdot )}[F(x,{\tilde{\xi }}) \vert {\tilde{\omega }} = \omega ]\) in (1). Instead of simply estimating \(f(x,\omega )\) in (1), our goal is to provide an efficient methodology to support simultaneous estimation and optimization in online settings where data and decision process evolve to attain asymptotically optimal decision to the following PSP problem.

$$\begin{aligned} x^*(\omega ) \in \arg \min _{x \in X} \ f(x, \omega ). \end{aligned}$$
(2)

Note that \(\omega \) is a fixed parameter in both (1) and (2). In the context of applications in reservoir management, x may denote the operating policy of the reservoir under the current observation of predictors such as humidity and temperature, and \(f(x,\omega )\) is the operating cost to be optimized.

In contrast to this paper, the works of Deng and Sen [10] and Kannan et al. [32] first discover a statistical model (i.e., a regression of \({\tilde{\xi }} \vert {\tilde{\omega }})\) and then incorporate that model within optimization. This is a two-step process referred to as Learning Enabled Optimization (LEO).

A somewhat different variant of the LEO setup is one where the regression is decision-dependent, and calls for a local regression approximating the relationship between \({{\tilde{\xi }}}\) and x (i.e., \({\tilde{\xi }} \vert x\)). That approach, which appears in [37], is referred to as Coupled Learning Enabled Optimization (CLEO), and is appropriate when one has the ability to generate data sets by querying a simulation.

A model which may be considered as the closest relative to (2) is studied by Bertsimas and Kallus [3]. It may be interpreted as an estimate-then-optimize strategy which uses non-parametric regression of the objective function. As observed in [46], this is also a two-step process where the non-parametric estimation of the objective function precedes the optimization process. Their study involves the asymptotic properties of the solution to the following problem:

$$\begin{aligned} {\hat{x}}^{{\mathcal {N}}}(\omega ) \in \arg \min _{x \in X} \ {\hat{f}}_{{\mathcal {N}}}(x,\omega ; S^{{\mathcal {N}}}) \triangleq \sum _{i=1}^{{\mathcal {N}}} v_{{\mathcal {N}},i}(\omega ) F(x,\xi _i). \end{aligned}$$
(3)

Here, \({\hat{f}}_{{\mathcal {N}}}(x,\omega ; S^{{\mathcal {N}}})\) denotes an approximation which is parameterized by x and \(\omega \) and depends on the dataset \( S^{{\mathcal {N}}}\). The weight function (i.e., \(v_{{\mathcal {N}},i}(\omega )\)) suggested by [3] is calculated by some non-parametric method, such as k nearest neighbors estimation, or kernel estimation, and depend on \(\omega \) and \(S^{\mathcal {N}}\). In other words, the data-driven collection of functions \(\{(\omega _i, F(x,\xi _i))\}_{i=1}^{{\mathcal {N}}}\) is used to construct an estimate of \(f(x,\omega )\). Thus, equation (3) should be considered as a non-parametric analog of ordinary Sample Average Approximation (SAA) used to approximate ordinary SP. In [3], the authors show that under certain assumptions (e.g., equi-continuity of the random objective function), solving (3) provides a statistically consistent estimate of an optimal solution to (2) (i.e., \(\lim \limits _{{\mathcal {N}} \rightarrow \infty } {\hat{x}}^{{\mathcal {N}}}(\omega ) = x^*(\omega ) \ w.p. 1\)). The consistency of such non-parametric estimation does not rely on specific assumptions regarding any particular distribution, and just as importantly, these approximations can be setup in a completely automated fashion, without much preprocessing. From the perspective of approximation theory this is very encouraging for estimating a decision model. In this sense, the results of [3] are analogous to the work of [34] for using ordinary SAA for ordinary SP, and both require some numerical algorithm to deliver decisions. It is important to observe that for streaming data applications, it is burdensome to recursively recompute the entire problem as new data points are observed. This handicap is corroborated by the computational results presented in this paper.

Although non-parametric estimation methods have many applications in pattern recognition, image processing, weather forecasting and many others, their prior use in stochastic programming (decision models) has been limited to [24], where kernel-type estimators (for the unknown distribution) are used in a simple recourse model (sometimes referred to as as \(L_1\) minimization in machine learning) with a fixed dataset and standard nonlinear programming techniques. In the aforementioned model, non-parametric estimation plays a static role in the sense that once the optimization model is setup, the non-parametric estimate is treated as a fixed model parameter. It is worth noting that the well-known bias of the non-parametric estimation makes its algorithmic use more or less uncharted territory. Indeed, our setup, which is a simultaneous estimation and optimization approach, allows sequential updates to decisions as the data stream evolves, thus making time-constrained (or real-time) decision-making a realistic possibility. The road-blocks to solving (2) within the above context arise due to three challenges:

  1. (a)

    Is there an implementable algorithm which can automatically refine the non-parametric estimation while the data process evolves with increasing \({\mathcal {N}}\)?

  2. (b)

    Given a dataset \(S^{ {\mathcal {N}}}\) suppose that the joint estimation-optimization step produces a decision \(x( S^{{\mathcal N}} )\). What properties should the joint estimation-optimization step impart to the sequence \(\{ x( S^{{{\mathcal {N}}}} )\}\) so that we obtain an optimal solution to (2) as \({\mathcal {N}} \rightarrow \infty \)? In other words, how should one design an optimization algorithm to ensure that the bias of non-parametric estimation does not adversely impact its asymptotic convergence?

  3. (c)

    Is it possible to exploit dynamics of the algorithm to recommend decision updates as the dataset evolves in size? How should one balance out both estimation and optimization efficiency?

This paper is devoted to answering these important questions which arise from simultaneous estimation and optimization.

In this paper, we develop two algorithms for solving PSPs: one built around extensions of biased first-order methods and another based on piecewise linear value function approximations. The former may be looked upon as an extension of stochastic approximation, or more accurately, stochastic quasi-gradient methods (SQG, [15]), while the next class of methods may be looked upon as a counterpart of stochastic decomposition (SD, [28]) with non-parametric estimation. It is important to recognize that by their very nature, non-parametric procedures typically lead to biased estimates in finite time, and as a result, a combination of estimation and optimization algorithms must be designed in a manner that can deliver an unbiased optimal decision (and value), without losing asymptotic optimality associated with either traditional SQG or SD.

The contributions of this paper are:

  1. 1)

    An algorithm for solving PSP which goes “beyond using sample frequency" to solve SP problems with conditional expectation, given \({\tilde{\omega }} = \omega \). The first-order PSP algorithm that we study, referred to as LEONA,Footnote 2 allows non-parametric estimation as an integral part of a solution algorithm for PSP. The LEONA algorithm provides the possibility of rapid response to streaming data, thus extending the horizon of SP algorithms to new applications.

  2. 2)

    We propose a decomposition-based algorithm which combines the power of the SD algorithm with k nearest neighbors estimation (SD-kNN). To the best of our knowledge, this is the first decomposition-based algorithm that uses non-parametric estimation to attain the predictive capability required by PSP.

  3. 3)

    We provide a general framework to control the errors from the simultaneous-estimation-optimization process. In [3], the emphasis is on the sequence of optimal solutions to each approximation, whereas our approach studies the asymptotic properties of the sequence of estimated solutions once per update in the streaming data process. Addressing such symbiotic structures is a major innovation of this work because it highlights the interplay of errors arising from statistical estimation and mathematical optimization. Preliminary computational evidence suggests the possibility of successful applications for rapid response in the presence of streaming data. The computational results show that both LEONA and SD-kNN algorithms can provide near-optimum solutions for a class of problems studied in this paper. Further computational results also show that the loss of the estimation quality of our algorithms is minimal.

  4. 4)

    In our computational results, we present numerical experiments using LEONA, SD-kNN, and multi-cut Benders decomposition as a benchmark. These experiments suggest that the new algorithms are able to provide faster and more reliable decisions for streaming data and comparisons with a standard benchmark (multi-cut Benders decomposition) reveal the potential power of the new algorithms with simultaneous estimation and optimization. To the best of our knowledge, these experiments are the first such results in the literature.

The remainder of the paper is organized as follows. In Sect. 2, we provide a brief review on non-parametric estimation methods. In Sect. 3, we present the LEONA algorithm and its convergence results. In Sect. 4, we introduce the algorithm design of SD-kNN to solve two-stage predictive stochastic linear programming problems. Finally, we demonstrate the computational performance of both simultaneous estimation-optimization algorithms in Sect. 5.

1.1 Notations

Since we are introducing notations in optimization and non-parametric estimation, we start the process of merging concepts via the notations introduced below.

To be consistent with the optimization literature, we use x for the decision variable and X for the set to which x belongs.Footnote 3 Let the joint distribution of \(({\tilde{\omega }},{\tilde{\xi }})\) be \(\mu _{{\tilde{\omega }}, {\tilde{\xi }}}\). Let \(\mu _{{\tilde{\omega }}}\) and \(\mu _{{\tilde{\xi }}}\) denote the marginal distribution functions of \({\tilde{\omega }}\) and \({\tilde{\xi }}\), respectively. We let \(\xi \) and \(\omega \) denote the realizations of \({\tilde{\xi }}\) and \({\tilde{\omega }}\), respectively. Let \(g(\omega ,\xi )\) be the joint density function at \(({\tilde{\omega }},{\tilde{\xi }}) = (\omega , \xi )\), and \(\mu _{{\tilde{\omega }}, {\tilde{\xi }}}\) has the form, \(\mu _{{\tilde{\omega }}, {\tilde{\xi }}}(d\omega ,d\xi ) = g(\omega ,\xi ) d\omega d\xi , \ (\omega ,\xi ) \in {\mathcal {X}} \times {\mathcal {Y}}\). Let \(m(\omega ) = \int _{{\mathcal {Y}}} g(\omega ,\xi ) d \xi \) and suppose that the conditional distribution given that \({\tilde{\omega }} = \omega \) is as follows (see [9] for details).

$$\begin{aligned} k(\xi \vert \omega ) = {\left\{ \begin{array}{ll} \frac{g( \omega , \xi )}{ m (\omega )} \quad &{}\text {if} \ m(\omega ) > 0, \\ \int _{{\mathcal {X}}} g(\omega ',\xi ) d{\omega '} \quad &{}\text {if} \ m(\omega ) = 0. \end{array}\right. } \end{aligned}$$
(4)

We use \({\mathbb {P}}\) and \({\mathbb {E}}\) to denote the probability and expectation operators, respectively. Specifically, the expectation \({\mathbb {E}}[\cdot ]\) taken in the algorithm is with respect to the Cartesian product of infinitely many probability spaces of i.i.d samples generated by the sampling procedure. The expectation of the bias is defined with respect to the Cartesian product of finitely many probability spaces which correspond to the dataset (\({\tilde{S}}^N\)) used. We let \(N<< {\mathcal {N}}\) and let \({\tilde{S}}^N \triangleq \{({\tilde{\omega }}_i, {\tilde{\xi }}_i) \}_{i=1}^N\) denote a set of N i.i.d. copies of \(({\tilde{\omega }},{\tilde{\xi }})\). By the convention followed in this paper, \(S^N\) denotes a realization of \({\tilde{S}}^N\). We use the Euclidean distance to measure the distance between the two predictors. We let \(\Vert \cdot \Vert \) denote the Euclidean norm of the vector and spectral norm of the matrix. Let \({\tilde{\omega }}^N_{[j]}(\omega )\) denote the \(j^{th}\) closest neighbor of \(\omega \) from a collection of random variables \(\{{\tilde{\omega }}_{i} \}_{i=1}^N\). We let \({\mathbb {I}}({\mathcal {E}})\) denote an indicator function whose value is 1 if the event \({\mathcal {E}}\) is true and 0 otherwise.

Since the true objective function in (2) is defined as a function of the decision x and the predictor \(\omega \), we define the directional derivative of such class of functions at \(x \in {\mathbb {R}}^n\) in the direction of \(d \in {\mathbb {R}}^n\) for a given parameter \(\omega \in {\mathcal {X}}\).

Definition 1

Let \(\zeta (x,\omega ): S \times {\mathcal {X}} \mapsto {\mathbb {R}}\), where S is an open convex set of \({\mathbb {R}}^n\) and \({\mathcal {X}}\) is a Borel subset of \({\mathbb {R}}^{m_1 }\). For a given \(\omega \in {\mathcal {X}}\), suppose that the directional derivative of \(\zeta (\cdot ,\omega )\) at \(x \in S\) in the direction of \(d \in {\mathbb {R}}^n\) exists and it is defined below:

$$\begin{aligned} \zeta '(x,\omega ;d) = \lim _{\tau \downarrow 0} \frac{\zeta (x + \tau d,\omega ) - \zeta (x,\omega )}{\tau }. \end{aligned}$$

2 Background on non-parametric statistical estimation

Statistical regression is used to model the relationship between predictors (sometimes referred to as independent variables) and responses (sometimes referred to as dependent variables). Without assuming any specific pre-determined structure of predictors and responses, non-parametric estimation is one branch of statistical regression which sets up models purely based on the provided sampled data. One of the main advantages of non-parametric estimation is that model-fitting is not required prior to the optimization process [18].

One of the non-parametric estimation methods is the k nearest neighbors method. As the name suggests, the mean of the k nearest neighbors (i.e., here, the word “nearest" is sensitive to the metric used to measure distance) of the predictor is used to as the estimate of the corresponding conditional expectation. After Fix and Hodges Jr [17] proposed the nearest-neighbor method in 1951, several articles [11, 12, 25, 51, 52] have contributed to the consistency of kNN regression. Among these, Walk [51] shows that the kNN estimate converges to its true conditional expectation almost surely under certain regularity conditions. We give the definition of kNN estimate of \(f(x,\omega )\) below.

Definition 2

Let \({\mathcal {S}}(k_N,\omega , \{\omega _j \}_{j=1}^N))\) be the \(k_N\) nearest neighbors of \(\omega \) from the sample set \(\{\omega _j \}_{j=1}^N\). The kNN estimate of \(f(x,\omega )\) is defined as follows.

$$\begin{aligned} {\hat{f}}_{k_N,N}(x, \omega ;S^N) \triangleq \frac{1}{k_N} \sum _{i=1}^N {\mathbb {I}}\left( \omega _i \in {\mathcal {S}}(k_N,\omega ; \{\omega _j \}_{j=1}^N) \right) F(x,\xi _i). \end{aligned}$$

Note that \({\hat{f}}_{k_N,N}(x, \omega ;S^N)\) is a measurable function of \(x, \ \omega \) and the data \(S^N\) (see [25] for more details). We provide the formal statement of asymptotic convergence of kNN estimate of the objective function in (2) below.

Theorem 2.1

([51, Theorem 1]) Let \({\tilde{S}}^N = \{({\tilde{\omega }}_i,{\tilde{\xi }}_i)\}_{i=1}^N\) be identically and independently distributed (i.i.d.) random variables. Let \(x \in X\) and suppose that \({\mathbb {E}}[\vert F(x,{\tilde{\xi }}) \vert ] < \infty \). Let \(k_N\) be monotone increasing with N, \(\lim _{N \rightarrow \infty } k_N = \infty \), \(\lim _{N \rightarrow \infty } \frac{k_N}{N} = 0\) and \(k_N\) varies regularly with exponent \(\beta \in (0,1]\) (e.g., \(k_N = \lfloor N^{\beta } \rfloor \), \(\beta \in (0,1)\)). The ties are broken randomly or by indices. Then the following holds:

$$\begin{aligned} \lim _{N \rightarrow \infty } {\hat{f}}_{k_N,N}(x,\omega ; {\tilde{S}}^N) = f(x,\omega ), \quad w.p.1. \end{aligned}$$

Another popular non-parametric regression method falls under the category of kernel methods. Specifically, some predefined kernel function is used to estimate the marginal density function. Since Nadaraya [39] and Watson [54] proposed the famous Nadaraya-Watson kernel estimate in 1964, it has been used in biostatistics [8], economics [6], image processing [49] etc. Several papers [11, 22, 23, 25, 35, 52] further refined the theory of kernel estimation. Let \({\mathcal {K}}^{kernel}\) denote a set of following kernel functions:

  1. 1.

    Naïve kernel: \(K^{{\mathcal {N}}}(z) \triangleq {\mathbb {I}}_{\{\Vert z\Vert \le 1\}}\).

  2. 2.

    Epanechnikov kernel: \(K^{{\mathcal {E}}}(z) \triangleq (1 - \Vert z\Vert ^2){\mathbb {I}}_{\{\Vert z\Vert \le 1\}}\).

  3. 3.

    Quartic kernel: \(K^{{\mathcal {Q}}}(z) \triangleq (1 - \Vert z\Vert ^2)^2{\mathbb {I}}_{\{\Vert z\Vert \le 1\}}\).

  4. 4.

    Gaussian kernel: \(K^{{\mathcal {G}}}(z) \triangleq e^{-\frac{\Vert z\Vert ^2}{2}}\).

Definition 3

Let \(K \in {\mathcal {K}}^{kernel}\). The kernel estimation of \(f(x,\omega )\) is defined as follows.

$$\begin{aligned} {\hat{f}}^{\textrm{kernel}}_N(x,\omega ; S^N) \triangleq \frac{\sum _{i=1}^N F(x,\xi _i) K(\frac{\omega - \omega _i}{h_N})}{\sum _{i=1}^N K(\frac{\omega - \omega _i}{h_N})}, \end{aligned}$$
(5)

Also note that \({\hat{f}}^{\textrm{kernel}}_N(x,\omega ; S^N)\) is a measurable function of x, \(\omega \), and data \(S^N\). The strong pointwise consistency property of the Nadaraya-Watson kernel regression estimate is summarized below.

Theorem 2.2

([52, Theorem 3]) Let \(\{({\tilde{\omega }}_i,{\tilde{\xi }}_i)\}_{i=1}^N\) be identically and independently distributed (i.i.d.) random variables. Let the bandwidth be denoted \(h_N > 0\) and \(h_N= C N^{-\beta }, \ {C > 0}, \ \beta \in (0,\frac{1}{m_1})\) (i.e., \(m_1\) is the dimension of \(\omega \)), for all \(N \in {\mathbb {N}}\). Let \(K \in {\mathcal {K}}^{\textrm{kernel}}\). Let \(x \in X\) and suppose that \({\mathbb {E}}\left[ \vert F(x,{\tilde{\xi }}) \vert \max \{\log \vert F(x,{\tilde{\xi }})\vert ,0 \} \right] < \infty \). If the kernel is used to calculate \(v_{N,i}(\omega )\), then the following holds:

$$\begin{aligned} \lim _{N \rightarrow \infty } {\hat{f}}^{\textrm{kernel}}_{N}(x,\omega ; {\tilde{S}}^N) = f(x,\omega ), \quad w.p.1. \end{aligned}$$

Remark

In [25, Theorem 6.2], Györfi et al. study the convergence rate of kNN estimation (i.e., \({\mathbb {E}}_{{\tilde{S}}^N}[\int _{{\mathcal {X}}} ( m^{k_N}_N(\omega ;{\tilde{S}}^N) - m(\omega ) )^2 \mu _\omega (d_\omega )]\)), where \(m^{k_N}_N(\omega ;{\tilde{S}}^N)\) is the kNN estimate of \(m(\omega )\). The same comment applies to the kernel estimate (see [25, Theorem 5.2]) as well. We anticipate that one may be able to obtain refined convergence rates (see [5] for details) by studying this difference (\(\sup \limits _{x \in X} \vert f^{k_N}_{N}(x,\omega ; {\tilde{S}}^N) - f(x,\omega ) \vert \)), but it is out of the scope of the current paper.

3 Learning enabled optimization with non-parametric approximations

Hannah et al. [26] includes non-parametric estimation in the context of function-valued and gradient-valued estimation. However, their focus is on an empirical investigation of non-parametric estimation, instead of a framework for statistical and mathematical analysis for a theory to support asymptotic results. In the context of non-parametric sample average approximation, the asymptotic results of Bertsimas and Kallus [3] resolved the open question of statistical consistency of the model. Our paper goes a step further than [3] by incorporating simultaneous estimation and optimization in presence of streaming data.

3.1 Mathematical formulation

In this subsection, we formally set up the mathematical foundations for LEONA. We let \(G(x,\xi )\) denote the subgradient of \(F(x,\xi )\). Given \(\omega \) as an observation of \({\tilde{\omega }}\), we introduce the following assumptions for ensuring the asymptotic properties of the LEONA algorithm.

(A0):

(Non-negativity of the weight function) Let \(v_{N,i}: {\mathbb {R}}^{m_1} \mapsto {\mathbb {R}}\). The weight function is non-negative for all i (i.e., \(v_{N,i}(\omega ) \ge 0\)) and \(\sum _{i=1}^N v_{N,i}(\omega ) \le 1\).

(A1):

(Probabilistic existence of \({\tilde{\omega }} = \omega \)) The joint distribution function \(\mu _{{\tilde{\omega }}, {\tilde{\xi }}}\) exists and the conditional distribution is defined by (4).

(A2):

(i.i.d assumption on data pairs) \(({\tilde{\omega }}_1,{\tilde{\xi }}_1), ({\tilde{\omega }}_2,{\tilde{\xi }}_2), \ldots \) are independent and identically distributed random variables.

(A3):

(Compact and convex feasible region) \(X \subset {\mathbb {R}}^n\) is a deterministic compact convex nonempty set.

(A4):

(Convex objective function and related boundedness properties) Let \(F:X \times {\mathcal {Y}} \mapsto {\mathbb {R}}\). For every \(\xi \in {\mathcal {Y}} \subset {\mathbb {R}}^{m_2}\), \(F(\cdot ,\xi )\) is convex on X, and its subdifferential is bounded on X (i.e. there exits \(M_G \in (0, \infty )\) such that \(\Vert G(x,\xi )\Vert \le M_G\)). Additionally, there exists \(M_F \in (0, \infty )\) such that \(\vert F(x,\xi ) \vert \le M_F\) for all \(x \in X\) and \(\xi \in {\mathcal {Y}}\).

(A5):

(Lipschitz continuity of the cost function) For any \(\xi \in {\mathcal {Y}}\), \(F(\cdot ,\xi )\) is Lipschitz continuous with an uniform constant \(L_F\) on X (i.e., \(\vert F(x,\xi ) - F(x',\xi ) \vert \le L_F \Vert x - x' \Vert \) for all \(x, x' \in X\)).

(A6):

Suppose that there exists \(\delta _N(\cdot ,\cdot ) \ge 0\) such that \(\vert {\mathbb {E}}[{\hat{f}}_{N}(x,\omega ; {\tilde{S}}^N)] - f(x,\omega ) \vert \le \delta _{N}(x,\omega )\) for all \(x \in X\) and \(\delta _N(\cdot ,\omega )\) converges uniformly to 0 on X, as \(N \rightarrow \infty \).

Note that assumption A0 automatically holds in both kNN and kernel estimation. Assumption A6 requires that the bias from the non-parametric estimation must converge to 0 uniformly on the feasible region, which will be shown to hold in both kNN and kernel estimation under assumptions A1A5. Note that assumption A6 does not impose a finite-sample bound on the estimation error.

We begin by stating a result from [13], which defines the comparison between the values of two conditional expectations.

Lemma 3.1

Let \((\Omega , \Sigma _\Omega , {\mathbb {P}})\) be a probability space. Let \(\Sigma _0 \subset \Sigma _\Omega \) and \(Y_1,Y_2 \in \Sigma _\Omega \). Suppose that \({\mathbb {E}}[\vert Y_1 \vert ] < \infty \) and \({\mathbb {E}}[\vert Y_2 \vert ] < \infty \). Then \(Y_1 \le Y_2\) implies \({\mathbb {E}}[Y_1 \vert \Sigma _0] \le {\mathbb {E}}[Y_2 \vert \Sigma _0] \ \text {w.p.1}\).

Next, Lemma 3.2 summarizes that the true objective function (2) and its non-parametric estimation (3) are convex under the assumptions given above.

Lemma 3.2

(Convexity of objective functions) Suppose assumptions A0A4 are satisfied. Then the following hold:

  1. (1)

    \(f(\cdot ,\omega )\) is convex on X, for \(\mu _{{\tilde{\omega }}}\) almost every \(\omega \in {\mathcal {X}}\).

  2. (2)

    \({\hat{f}}_N(\cdot ,\omega ;S^N)\) is convex on X.

Lemmas 3.1 and 3.2 establish the structure of the model (convexity) and at a conceptual level, it is amenable to large scale optimization. One consequence of convexity is that for points where the value is finite, the standard subdifferential of convex analysis exists. Ermoliev and Gaivoronski [15, 16] define a class of stochastic subgradients with a particular property of bias in classical SP (i.e., without covariates).

Definition 4

[15] Let \(\hat{{\mathcal {G}}}_N(x; \{\xi _i\}_{i=1}^N)\) be a measurable function of x and \(\{\xi _i \}_{i=1}^N\). Furthermore, let \(x^* \in \arg \min \limits _{x \in X} f(x) \triangleq {\mathbb {E}}_{{\tilde{\xi }}}[F(x,{\tilde{\xi }})]\). Then \(\hat{{\mathcal {G}}}_N(x;\{\tilde{\xi _i}\}_{i=1}^N)\) is a stochastic quasi-gradient of f(x) if it satisfies the following condition:

$$\begin{aligned} f(x^*) - f(x) \ge (x^* - x)^\top {\mathbb {E}}[\hat{{\mathcal {G}}}_N(x;\{\tilde{\xi _i}\}_{i=1}^N)] + \tau _N, \ \text {where} \ \lim _{N \rightarrow \infty } \tau _N = 0. \end{aligned}$$
(6)

We extend the notion of stochastic quasi-gradient to the case of SP with covariates. As a result, our approach opens a whole new area of applying non-parametric estimation for PSP.

Definition 5

Let \(\hat{{\mathcal {G}}}_N(x,\omega ; S^N)\) be a measurable function of x, \(\omega \), and \(S^N\). Furthermore, let \(x^*\) be an optimal solution of (2). Then \(\hat{{\mathcal {G}}}_N(x,\omega ;{\tilde{S}}^N)\) is a stochastic quasi-gradient of \(f(x,\omega )\) if it satisfies the following condition:

$$\begin{aligned} f(x^*,\omega ) - f(x,\omega ) \ge (x^* - x)^\top {\mathbb {E}}[ \hat{{\mathcal {G}}}_N(x,\omega ;{\tilde{S}}^N)] + \tau _N, \ \text {with} \ \lim _{N \rightarrow \infty } \tau _N = 0. \end{aligned}$$
(7)

Unlike Monte Carlo estimates of subgradients, estimation using stochastic quasi-gradient is biased, and moreover, these estimates are not \(\epsilon \)-subgradients in the standard sense because the errors are not required to be positive. Recall that we let \(G(x,\xi )\) denote a subgradient of \(F(x,\xi )\). By the scaling and addition properties of subdifferentials, we define Non-parametric Stochastic Quasi-gradient (NSQG) below.

Definition 6

Let \({\hat{G}}_N(x,\omega ; S^N)\) be a measurable function of x, \(\omega \), and data \(S^N\). \({\hat{G}}_N(x,\omega ; S^N)\) is a Non-parametric Stochastic Quasi-gradient (NSQG) of \(f(x,\omega )\) in (2), if it can be written as the weighted sum below,

$$\begin{aligned} {\hat{G}}_N(x,\omega ; S^N) \triangleq \sum _{i=1}^N v_{N,i}(\omega ) G(x,\xi _i), \end{aligned}$$
(8)

where \(G(x,\xi _i)\) is the subgradient of \(F(x,\xi _i)\) and the weight \(v_{N,i}(\omega )\) is calculated via non-parametric estimation methods.

This notion of NSQG in Definition 6 is the key component of LEONA algorithm and distinguishes it from other methods. According to Altman [1], a non-parametric estimator (e.g., kNN estimator and kernel estimators) is often biased. But under mild assumptions, the following analysis will show that the bias of both kNN and kernel estimators of true subgradients decrease to zero as the sample size increases to infinity. For now, we focus on a general case in which the estimation bias converges to 0 uniformly on X and show that such a condition will guarantee that \({\hat{G}}_N(x,\omega ; {\tilde{S}}^N)\) is a stochastic quasi-gradient of \(f(x,\omega )\).

Lemma 3.3

Let \({\hat{f}}_N(x,\omega ;{\tilde{S}}^N)\) be defined in (3). Suppose assumptions A0 - A4 and A6 are satisfied. Then \({\hat{G}}_N(x,\omega ;{\tilde{S}}^N)\) is a stochastic quasi-gradient of \(f(x,\omega )\) by definition 5.

Proof

See Appendix A.1. \(\square \)

3.2 Algorithm design

Now we introduce a subroutine from the LEONA algorithm, which we refer to as NSQG Update. In formulating the joint-estimation-optimization process, we adopt a k nearest-neighbor (kNN)/ kernel approach for subgradient estimation and a stochastic quasi-gradient method for optimization ([16]). Such a process can be understood as a simultaneous learning approach that uses statistical learning to improve the decision making in the first-order algorithm. Given the current estimated decision \(x_l\), stepsize \(\gamma _l\), and sample size \(N_l\), the key feature of NSQG Update is that it generates an NSQG at \(x_l\), \({\hat{G}}_{N_l}(x_l,\omega ;S^{N_l}_l)\), to update the decisions. The associated update formula is given as follows:

$$\begin{aligned} x_{l+1} = \Pi _X (x_l - \gamma _l {\hat{G}}_{N_l}(x_l,\omega ;S^{N_l}_l)) \end{aligned}$$
(9)

which is equivalent to solving a proximal mapping problem:

$$\begin{aligned} x_{l+1} = \arg \min _{x \in X} \{x^\top {\hat{G}}_{N_l}(x_l,\omega ;S^{N_l}_l) + \frac{1}{2 \gamma _l} \Vert x - x_l\Vert ^2 \} \end{aligned}$$
(10)
Algorithm 1
figure a

NSQG Update

Algorithm 2
figure b

LEONA

Based on data generating process in the NSQG update, the dataset \({\tilde{S}}^{N_l}_l\) is independent of \({\tilde{S}}^{N_j}_j\) for any \(j \ne l\).

Inspired by Polyak’s averaging approach in stochastic approximation [40, 44], we utilize the proposed NSQG Update (in Algorithm 1) to develop a first-order method which guarantees asymptotic convergence. We refer to this method as a LEONA algorithm. It is worth noting that since the response is assumed to be parametrized by the predictor, the convergence of the proposed algorithm is consistent for all possible realizations of the predictor. Let \(x_0 \in X\) denote the initial estimated solution. Let nm be some pre-specified positive integers (e.g., \(n = 1, m = 2\)). Here, we have \(n-1\) number of NSQG calls for warm starting, which generates \(x_1,\ldots , x_n\). m is the increment on the moving window size. In other words, parameter n determines the number of NSQG updates before the moving average is initialized, while parameter m determines how fast the size of the moving window increases. We further let \(i_0 = n\) and \(j_0 = n - 1\), which will be used to determine which portion of the iterates for decision averaging. At the \(q^{\textrm{th}}\) outer iteration, it computes a constant step size,

$$\begin{aligned} {\tilde{\gamma }}_q = \frac{2D_X}{M_G \sqrt{mq}}, \end{aligned}$$

where \(D_X = \max \limits _{x \in X} \Vert x - x_0\Vert \) and \(M_G\) is the upper bound of the subgradient (see assumption A4). The window size (i.e., number of calls of NSQG updates) of the current inner loop is mq, which increases as q increases. It then calculates a starting index, \(i_q = i_{q-1} + m(q - 1)\) and an ending index, \(j_q = j_{q - 1} + mq\), which are consistent with the indices of the previously generated iterates. Starting from \(x_{i_q - 1}\), we call NSQG updates mq times to generate \(x_{i_q}, x_{i_q + 1}, \ldots , x_{j_q}\). Then we take the average of \(\{x_{i_q}, x_{i_q + 1}, \ldots , x_{j_q}\}\) to get the estimated decision in the \(q^{th}\) outer loop:

$$\begin{aligned} {\tilde{x}}_q = \frac{1}{mq}\sum _{l=i_q}^{j_q} x_l. \end{aligned}$$

The entire process of the LEONA algorithm is summarized in algorithm 2. When \(M_G\) and \(D_X\) are unknown, one chooses a step size \({\tilde{\gamma }}_q = \frac{C}{\sqrt{mq}}\), where C is a finite positive constant (i.e., \(C \in (0, \infty )\)). In other words, as long as we choose a constant step size (in each moving window) which is of the order of the inverse of the square root of the moving window size (i.e., \(O(\frac{1}{\sqrt{mq}})\)), the asymptotic convergence of LEONA algorithm will be assured. This circumstance will be discussed in Sect. 3.3.

Remark

The motivation for using the moving averaging window is as follows. By increasing the size of the averaging window (i.e. increase q in mq), we generate a sequence of averaged estimated solutions \({{\tilde{x}}_q}\), which is shown to produce an optimality gap which converges to 0.

3.3 Convergence analysis

There are two meta-schemes in the LEONA algorithm: one is function estimation, and the other is solution approximation. We do not solve non-parametric SAA to obtain the estimated solution. Instead, we use a first-order approximation of objective function in the non-parametric SAA to improve the estimated solution. Such a process evolves as new samples are encountered. As a result, both the sequence of approximations and the sequence of estimated solutions (based on such approximations) need to be shown to converge. To the end of this section, we let l denote the \(l^{th}\) iteration of NSQG Update. Let \(x_l\) and \(x_{l+1}\) denote two successive iterates generated by NSQG Update. Let \(\{{\tilde{x}}_q\}_q\) denote a sequence of decisions generated by LEONA, where q denotes the outer iteration number.

In this subsection, we first develop a contraction property of the NSQG Update and then prove the convergence of the LEONA algorithm in expectation.

Lemma 3.4

(Contraction in NSQG Update) Suppose that assumptions A0A4 and A6 are satisfied. Then, for any \(x' \in X\) the following inequality holds for all l:

$$\begin{aligned} \begin{aligned} {\mathbb {E}}[\Vert x_{l+1} - x'\Vert ^2] \le \ {}&{\mathbb {E}}[\Vert x_l - x'\Vert ^2] - 2 \gamma _l {\mathbb {E}}[f(x_l,\omega ) - f(x',\omega )] + \gamma _l^2 M_G^2\\&+ 2 \gamma _l {\mathbb {E}}[\delta _{N_l}(x_l, \omega ) + \delta _{N_l}(x',\omega )] \end{aligned}. \end{aligned}$$

Proof

See Appendix A.2. \(\square \)

There are two types of estimated solution updates in the LEONA algorithm. Notwithstanding the bias associated with stochastic quasi-gradients, LEONA assures asymptotic convergence by extending the averaging process in which we use an ever-increasing size of the window. In the following, we let \(x^* \in \arg \min \limits _{x \in X} f(x,\omega )\) specified in (2) and \(c^*(\omega ) = \min \limits _{x \in X} f(x,\omega )\).

Lemma 3.5

Let \(\omega \) be fixed. Suppose assumptions A0A4 and A6 are satisfied. Let \(A_l = \sup \{\delta _{N_l}(x,\omega ): x \in X \}\). Then the following holds:

$$\begin{aligned} {\mathbb {E}}[f({\tilde{x}}_q,\omega ) - c^*(\omega ))] \le \frac{2}{mq} \sum _{l=i_q}^{j_q} A_l + \frac{2 D_X M_G }{\sqrt{mq}}, \end{aligned}$$

where both \(i_q\) and \(j_q\) are defined in Algorithm 2.

Proof

See Appendix A.3. \(\square \)

The systematic error of estimated solutions in the \(q^{th}\) outer loop of the algorithm consists of errors resulting from the step size, which is \(\frac{2 D_X M_G }{\sqrt{mq}}\), as well as the bias from non-parametric estimation, which is \(\frac{2}{mq} \sum _{l=i_q}^{j_q} A_l\). Now we state the convergence of the LEONA algorithm in the following theorem.

Theorem 3.6

(Asymptotic convergence of LEONA) Let \(\omega \) be fixed. Suppose assumptions A0A4 and A6 are satisfied. Then the following holds:

$$\begin{aligned} \lim _{q \rightarrow \infty } {\mathbb {E}}[f({\tilde{x}}_q,\omega ) - c^*(\omega )] = 0 \end{aligned}$$

Proof

According to Lemma 3.5, we shall show that both \(\frac{2}{mq} \sum _{l=i_q}^{j_q} A_l\) and \(\frac{2 D_X M_G }{\sqrt{mq}}\) converge to 0 as q goes to infinity. It is obvious that \( \lim \limits _{q \rightarrow \infty } \frac{2D_X M_G }{\sqrt{mq}} = 0\). By assumption A6, we have \( \lim \limits _{l \rightarrow \infty } A_l = 0\). We let \(A_{l_q} = \max _{i_q \le l \le j_q} \{ A_l\}\)and hence \(0 \le \frac{2}{mq} \sum _{l=i_q}^{j_q} A_l \le 2 A_{l_q}\). Since \(\{ A_{l_q} \}\) is a subsequence of \(\{ A_l \}\) and \(\lim \limits _{l \rightarrow \infty } A_l = 0\), then \(\lim \limits _{q \rightarrow \infty } A_{l_q} = 0\). Hence, \(\lim \limits _{q \rightarrow \infty } \frac{2}{mq}\sum _{l=i_q}^{j_q} A_l = 0\). Therefore, we have \( \lim \limits _{q \rightarrow \infty } {\mathbb {E}}[f({\tilde{x}}_q,\omega ) - c^*(\omega ))] = 0. \) \(\square \)

Remark

The analysis within each moving window has the same flavor as that in [40]. However, the design of the moving window is the key to ensuring that the non-parametric estimation bias diminishes to 0. In the case when \(M_G\) and \(D_X\) are unknown, we can pick the stepsize \(\frac{C}{\sqrt{mq}}\), where C is a positive constant (i.e., \(C \in (0, \infty )\)). In this case, we have \( {\mathbb {E}}[f({\tilde{x}}_q,\omega ) - c^*(\omega ))] \le \frac{2}{mq} \sum _{l=i_q}^{j_q} A_l + \frac{2 D^2_X + \frac{1}{2} C^2 M_G^2 }{C \sqrt{mq}}. \) Again, since \(D_X\), \(M_G\), m and C are finite constants, we have

$$\begin{aligned} \lim _{q \rightarrow \infty } \frac{2 D^2_X + \frac{1}{2} C^2 M_G^2 }{C \sqrt{mq}} = \lim _{q \rightarrow \infty } q^{-\frac{1}{2}} \left[ \frac{2 D^2_X + \frac{1}{2} C^2 M_G^2 }{C \sqrt{m}} \right] = 0. \end{aligned}$$

This implies that \(\lim \limits _{q \rightarrow \infty } {\mathbb {E}}[f({\tilde{x}}_q,\omega ) - c^*(\omega ))] = 0\). Although this step size rule is not optimal, the asymptotic convergence of LEONA is still valid.

3.4 LEONA with kNN estimation

We begin by providing the kNN realization of NSQG and then analyze the impact of the bias on the kNN estimate. We will show that LEONA with kNN estimation (LEONA-kNN) satisfies the conditions in Theorem 3.6.

Given a positive integer \(k_N\), a dataset \(S^N\), and an observation of the predictor \(\omega \), we aim to find the \(k_N\) covariates from this dataset closest (in the Euclidean distance) to \(\omega \). We require that ties are broken by indices (see the discussion of tie-breaking by Walk [51]), and \(k_N\) must be smaller than N. If \(\omega _i\) belongs to the set of \(k_N\) nearest neighbors of \(\omega \), \(v_{N,i}(\omega )\) is set to be 1; otherwise, \(v_{N,i}(\omega )\) is set to be 0. In other words, \(v_{N,i}(\omega ) = {\mathbb {I}}\left( \omega _i \in {\mathcal {S}}(k_N,\omega ; \{\omega _j \}_{j=1}^N) \right) \) for \(i = 1,2,\ldots ,N\).

Corollary 3.7

(Asymptotic convergence of LEONA-kNN) Let \(\omega \) be fixed. Let \(v_{N,i}(\omega ) = \frac{1}{k_N}{\mathbb {I}}(\omega _i \in {\mathcal {S}}(k_N,\omega ;\{\omega \}_{j=1}^N))\) and \(k_N = \lfloor N^\beta \rfloor \), \(\beta \in (0,1)\). Let \(\{{\tilde{x}}_q\}_q\) be generated by LEONA-kNN. Suppose assumptions A1 - A5 are satisfied. Then \(\lim \limits _{q \rightarrow \infty } {\mathbb {E}}[f({\tilde{x}}_q, \omega ) - c^*(\omega )] = 0\).

Proof

Note that assumption A0 is automatically satisfied by the weight function calculation of kNN estimation. By assumptions A4A5, it is easy to verify that \(\{{\mathbb {E}}[{\hat{f}}_{k_N,N}(\cdot ,\omega ;{\tilde{S}}^N)] \}_{N}\) is Lipschitz continuous with a common Lipschitz constant, \(L_F\), which further implies the equicontinuity. For any \(x \in X\), by Theorem 2.1, we have \( \lim \limits _{N \rightarrow \infty } {\hat{f}}_{k_N,N}(x,\omega ;{\tilde{S}}^N) = f(x,\omega ) \ w.p. 1\) Then by the Bounded Convergence Theorem, we have

$$\begin{aligned} \lim _{N \rightarrow \infty } {\mathbb {E}}[{\hat{f}}_{k_N,N}(x,\omega ;{\tilde{S}}^N)] = f(x,\omega ). \end{aligned}$$
(11)

Combining (11) and equicontinuity of \(\{{\mathbb {E}}[{\hat{f}}_{k_N,N}(\cdot ,\omega ;{\tilde{S}}^N)] \}_{N}\) in (1), by [45, Theorem 7.25], we conclude that \( \lim \limits _{N \rightarrow \infty } \vert {\mathbb {E}}[{\hat{f}}_{k_N,N}(x,\omega ;{\tilde{S}}^N)] - f(x, \omega ) \vert = 0 \ \text {uniformly on } X. \) Finally, by Theorem 3.6, we get the desired result. \(\square \)

3.5 LEONA with kernel estimation

In this subsection, we introduce another variation of the LEONA algorithm, which utilizes the Nadaraya-Watson kernel estimator to estimate the subgradient of \(f(x,\omega )\). Recall that the kernel function is denoted as K. In the literature on kernel methods, \(h_N\) is known as bandwidth. We shall refer to this variation as LEONA-kernel. To ensure the pointwise convergence of kernel estimate, Theorem 2.2 suggests that \(h_N\) needs to satisfy \(h_N= C N^{-\beta }, \ \beta \in (0,\frac{1}{m_1})\) (recall that \(m_1\) is the dimension of the predictor \({\tilde{\omega }}\)). The non-parametric stochastic quasi-gradient of \(f(x,\omega )\) based on the Nadaraya-Watson kernel estimator is written as follows, which we refer to as the kernel estimate of NSQG.

$$\begin{aligned} {\hat{G}}^{\textrm{kernel}}_{N}(x,\omega ;S^N) \triangleq \frac{\sum _{i=1}^{N} G(x,\xi _i) K(\frac{\omega - \omega _i}{h_N})}{\sum _{i=1}^{N} K(\frac{w - w_i}{h_N})}. \end{aligned}$$
(12)

The benchmark function of kernel estimate of NSQG is provided below.

$$\begin{aligned} {\hat{f}}^{\textrm{kernel}}_N(x,\omega ;S^N) \triangleq \frac{\sum _{i=1}^N F(x,\xi _i) K(\frac{\omega - \omega _i}{h_N})}{\sum _{i=1}^N K(\frac{\omega - \omega _i}{h_N})}. \end{aligned}$$
(13)

In the analysis of kernel estimates, we define \(\frac{0}{0} = 0\) (see [52]), because it is possible that \(\sum _{i=1}^N K(\frac{\omega - \omega _i}{h_N}) = 0\) when a kernel with compact support (e.g., Naïve kernel) is used. With this definition, for any \(K \in {\mathcal {K}}\), we have

$$\begin{aligned} \frac{K(\frac{w - w_i}{h_N})}{\sum _{i=1}^N K(\frac{\omega - \omega _i}{h_N})} \ge 0 \ \forall \ i, \ \text {and} \ \sum \frac{K(\frac{w - w_i}{h_N})}{\sum _{i=1}^N K(\frac{\omega - \omega _i}{h_N})} \le 1. \end{aligned}$$
(14)

Therefore, equation (14) implies that the requirement in assumption A0 is satisfied.

Corollary 3.8

(Asymptotic convergence of LEONA-kernel) Let \(K \in {\mathcal {K}}^{kernel}\) and let \(\omega \) be fixed. Let \(v_{N,i}(\omega ) = \frac{K(\frac{\omega - \omega _i}{h_N})}{\sum _{i=1}^N K(\frac{\omega - \omega _j}{h_N})}\), where the bandwidth \(h_N\) satisfies that \(h_N= C N^{-\beta }, \ C > 0, \ \beta \in (0,\frac{1}{m_1})\). Let \(\{{\tilde{x}}_q\}_q\) be generated by LEONA-kernel. Suppose assumptions A1 - A5 are satisfied. Further suppose that \({\mathbb {E}}\left[ \vert F(x,{\tilde{\xi }}) \vert \max \{\log \vert F(x,{\tilde{\xi }}) \vert ,0 \} \right] < \infty \) for all \(x \in X\). Then the following holds:

$$\begin{aligned} \lim _{q \rightarrow \infty } {\mathbb {E}}[f({\tilde{x}}_q,\omega ) - c^*(\omega )] = 0. \end{aligned}$$

Proof

The proof is similar to Corollary 3.7, the only difference is that Theorem 2.2 is used for the pointwise convergence of kernel estimate. \(\square \)

4 Non-parametric stochastic decomposition

We begin by noting that traditional cutting plane methods (e.g., Benders) rely on valid supporting hyperplanes of the true objective function. In the presence of streaming data, the cuts computed in any iteration are not only biased at a given point but may in fact cut away portions of the asymptotic objective function and may never recover without additional modifications to Benders-type methods. However, SD-type algorithms ([27, 28, 47]) avoid such issues by performing a local piecewise linear approximation of the objective function followed by necessary cut updates. For this reason, we propose a non-parametric extension of the SD algorithm to solve a two-stage predictive stochastic linear programming (two-stage predictive SLP) problem stated as follows.

$$\begin{aligned} \begin{aligned} \min _{x} \ f(x, \omega ) =&\ c^\top x + {\mathbb {E}}_{{\tilde{\xi }} \sim K(\omega ,\cdot )}[h(x,{\tilde{\xi }}) \vert {\tilde{\omega }} = \omega ]\\ \text {s.t.} \ {}&x \in X \triangleq \{x \in {\mathbb {R}}^n : A x \le b\}, \end{aligned} \end{aligned}$$
(15)

where \(h(x,\xi )\) is the second-stage recourse function, whose formulation is

$$\begin{aligned} \begin{aligned} h(x,\xi ) = \min \{ d^\top y : D y = e(\xi ) - C(\xi ) x, \ y \ge 0\}. \end{aligned} \end{aligned}$$

Throughout this section, we assume the following conditions hold:

(B1):

X is a compact convex nonempty subset of \({\mathbb {R}}^n\).

(B2):

The second-stage subproblem satisfies the relative complete recourse property.

(B3):

(Regularity of the probability space) The set of all the possible outcomes of \({\tilde{\xi }}\), \({\mathcal {Y}}\), is compact. \({\mathcal {Y}}\) is a Borel subset of a complete separable metric space and \(\Sigma _{{\mathcal {Y}}}\) is the \(\sigma \)-algebra generated by \({\mathcal {Y}}\). \({\mathcal {X}}\) is a Borel subset of a complete separable metric space and \(\Sigma _{{\mathcal {X}}}\) is the \(\sigma \)-algebra generated by \({\mathcal {X}}\). Let \(S = {\mathcal {X}} \times {\mathcal {Y}}\) and let \(\Sigma _{\mathcal {S}}\) be the \(\sigma \)-algebra generated by \({\mathcal {S}}\). \({\mathcal {S}}\) is a Borel subset of a complete separable metric space. Furthermore, the joint distribution function \(\mu _{{\tilde{\omega }}, {\tilde{\xi }}}\) exists and the conditional distribution of \({\tilde{\xi }}\) given \({\tilde{\omega }} = \omega \) is defined by (4).

(B4):

For any \(\xi \in {\mathcal {Y}}\), \({\mathbb {P}}(\Vert {\tilde{\xi }} - \xi \Vert < \delta ) > 0\) for any \(\delta > 0\).

(B5):

\(({\tilde{\omega }}_1, {\tilde{\xi }}_1), ({\tilde{\omega }}_2, {\tilde{\xi }}_2), \ldots \) are i.i.d. random variables.

(B6):

\(h(x,\xi ) \ge 0\) for all \(x \in X\) and almost every \(\xi \in {\mathcal {Y}}\). There exists \(M_h < \infty \), such that \(h(x,\xi ) \le M_h\) for all \(x \in X\) and almost every \(\xi \in {\mathcal {Y}}\).

(B7):

There exists \(M_C \in (0, \infty )\) such that \(\Vert C(\xi )\Vert \le M_C\) for almost every \(\xi \in {\mathcal {Y}}\).

(B8):

\(k_l = \lfloor l^\beta \rfloor \) and \(\beta \in (0,1)\), for \(l \in {\mathbb {N}}\).

Assumptions B1B2 are standard in the SP literature. Assumptions B3B5 ensure that the algorithm can sample a scenario that is arbitrarily close to the one that is already generated, which is essential to the correctness of the algorithm. Assumption B6 is necessary for the update of the previous minorants in the streaming data applications. Assumptions B2 and B7 altogether imply that second-stage recourse function is Lipschitz continuous on X. Unlike Machine Learning (ML) where deviations are not constrained optimization problems, two-stage SP models have the potential to magnify the impact of uncertainty due to second-stage constraints. As a result, the boundedness assumption maintains some regularity in the objective functions associated with estimated decisions. For the same reason, we also include the relative complete recourse assumption (assumption B2) which precludes infeasibility of the second-stage problem. Assumption B8 ensures that \(k_l\) satisfies the regularity condition in the convergence of kNN estimation.

For the sample size equal to l, the \(k_l\) nearest neighbor estimate of \(f(x, {\tilde{\omega }})\) in (15) is

$$\begin{aligned} {\hat{f}}^{k_l}_l(x, \omega ; S^l) \triangleq c^\top x + \frac{1}{k_l} \sum _{i=1}^l {\mathbb {I}}\left( \omega _i \in {\mathcal {S}}(k_l,\omega ; \{\omega _j \}_{j=1}^l) \right) h(x,\xi _i), \end{aligned}$$

where \(S^l = \{(\omega _i,\xi _i) \}_{i=1}^l\) is a dataset and \({\mathcal {S}}(k_l,\omega , \{\omega _j \}_{j=1}^l))\) is the k nearest neighbors of \(\omega \) from the sample set \(\{\omega _j \}_{j=1}^l\). An alternative representation of \({\hat{f}}^{k_l}_l\) is by using order statistics, which is \( {\hat{f}}^{k_l}_l(x, \omega ; S^l) = c^\top x + \frac{1}{k_l} \sum _{i=1}^{k_l} h(x,\xi ^l_{[i]})\), where the tuple \(( \omega ^l_{[i]},\xi ^l_{[i]})\) has \(\omega ^l_{[i]}\) which is the \(i^{th}\) nearest neighbor of \(\omega \) from the set \(\{\omega _j \}_{j=1}^l\).

Algorithm 3
figure c

SD-kNN

4.1 Algorithm design

In the SD-kNN algorithm, the minorant is referred to as a linear function of the first-stage decision, x, that is a lower bound of the kNN estimate of the objective function. One iteration of the SD-kNN in Algorithm 3 consists of five components. We will illustrate all five components by introducing the entire process in the \(l^{\textrm{th}}\) iteration. The first component is Candidate Selection, which selects the next candidate solution by solving a proximal mapping problem given piecewise linear approximation of the kNN estimate of objective function. Problem (16) is equivalent to

$$\begin{aligned} x^l = \arg \min _{x \in X} \ {\check{f}}_{l-1}(x, \omega ) + \frac{\sigma _l}{2} \Vert x - {\hat{x}}^{l-1} \Vert ^2, \end{aligned}$$

where \({\check{f}}_{l-1}(x, \omega )\) is a piecewise linear approximation of the kNN estimate of objective function, \({\hat{f}}^{k_{l-1}}_{l-1}(x, \omega ; S^{l-1})\), \({\hat{x}}^{l-1}\) is the incumbent solution in the last iteration and \(x^l\) is the current candidate solution. SD-kNN then generates a new covariate, \((\omega _l,\xi _l)\), let \(k_l = \lfloor l^\beta \rfloor \), and find the new kNN set, \({\mathcal {S}}(k,\omega ; \{\omega _j \}_{j=1}^{l}) = \{ \omega ^l_{[i]} \}_{i=1}^{k_l}\). Next, SD-kNN computes a “possibly" dual extreme point at the new candidate solution for the new covariate:

$$\begin{aligned} \pi ^l \in \arg \max \{ \pi ^\top (e(\xi _l) - C(\xi _l) x^l) \ \vert \ \pi ^\top D \le d^\top \}. \end{aligned}$$

By including the newly computed dual extreme point, the set of explored dual extreme points is updated. The minorants which are inactive at the new candidate solution will also be removed (see equation (17)). The second and third components are Dual Selection and Minorant Construction, which use the set of explored dual extreme points and the kNN set to construct new minorants at the candidate and incumbent solutions. The constructed minorants are denoted as \({\mathcal {C}}_l^l(x,\omega )\) and \({\mathcal {C}}_{-l}^{l}(x,\omega )\) (see algorithm 3 for details). The fourth component is Minorant Update, which updates the minorants in the previous iterations for streaming data applications:

$$\begin{aligned} {\mathcal {C}}^{l}_i(x,\omega ) = {\left\{ \begin{array}{ll} {\mathcal {C}}^{l-1}_i(x,\omega ) - \frac{M_h}{k_{l-1}} &{} \text {if } k_l = k_{l-1}, \\ \frac{k_{l-1}}{k_{l}} {\mathcal {C}}^{l-1}_i(x,\omega ) &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$
(18)

By doing so, it ensures that \(c^\top x + {\mathcal {C}}^{l}_i(x,\omega ) \le {\hat{f}}^{k_{l}}_{l}(x, \omega ; S^{l})\), which will be proven later. SD-kNN then uses the updated minorants and newly constructed minorants to construct a new picewise linear estimate of the objective function in (15):

$$\begin{aligned} \check{f}_l(x,\omega ) = c^\top x + \max \{ {\mathcal {C}}^l_i (x,\omega ): i \in {\mathcal {J}}_l\}, \end{aligned}$$

where \({\mathcal {J}}_l\) is an index set of the minorants used for function construction (see algorithm 3 for details). The last component is Incumbent Selection, which selects a sequence of incumbents that ensure “relative" descent:

$$\begin{aligned} {\hat{x}}^l = {\left\{ \begin{array}{ll} x^l &{} \text {if } \check{f}_l(x^l,\omega ) - \check{f}_l({\hat{x}}^{l-1}, \omega ) \le q \left[ \check{f}_{l-1}(x^l,\omega ) - \check{f}_{l-1}({\hat{x}}^{l-1}, \omega )\right] , \\ {\hat{x}}^{l-1} &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

We summarize all the steps of SD-kNN in algorithm 3.

Since both Candidate Selection and Incumbent Selection are inspired by the regularized SD in [28], we can get the same descent condition by solving (16).

Lemma 4.1

Given \(\sigma \ge 1\), then the following holds:

$$\begin{aligned} \check{f}_l(x^{l+1},\omega ) - \check{f}_l({\hat{x}}^{l},\omega ) \le -\sigma \Vert x^{l+1} - {\hat{x}}^{l} \Vert ^2 \le -\Vert x^{l+1} - {\hat{x}}^{l} \Vert ^2 \le 0. \end{aligned}$$

Recall that in the LEONA algorithm, we generate a dataset of covariates that is independent from the past to calculate NSQG. Thereby, the previously calculated NSQG will never be used in the future. On the other hand, SD-kNN uses an incremental-sampling schema in the sense that it uses all the generated data to calculate a new NSQG. Moreover, it uses previously calculated NSQGs and current NSQGs to construct an outer approximation of the objective function. Similar to SD, SD-kNN also keeps a record of captured second-stage duals to accelerate the computation of the NSQGs.

Compared with SD, the main differences of SD-kNN lies in the minorant construction and minorant updates. Namely, the minorant construction will be achieved by the kNN estimation method and the minorant update in (18) will ensure the updated minorants become a lower bound of the new kNN estimate of the obejctive function. In the SD-kNN algorithm, we do not explicitly compute the Lagrange multiplier for every sample. Instead, we only compute the Lagrange multiplier for samples that are from the k nearest neighbor set of \(\omega \). That is, if \(k_l = k_{l+1}\) and \(\Vert \omega ^l_{[k_l]} - \omega \Vert = \Vert \omega _{l+1} - \omega \Vert \), then \(\omega _{l+1}\) will not be included in the \(k_{l+1}\) nearest neighbor set. The correctness of the new minorant update is validated in the following proposition.

Proposition 4.2

(Lower Bound Properties in the Minorant Update) Suppose that assumptions B1B8 hold. Let \({\mathcal {J}}_l\) denote the set of minorants used in the \(l^{\textrm{th}}\) iteration and let \({\mathcal {C}}^l_i(x,\omega ) = \alpha ^l_i + (\beta ^l_i)^\top x \), \(i \in {\mathcal {J}}_l\) be the minorant that is generated first at iteration i and updated at iteration l. Suppose that \({\hat{f}}^{k_l}_l(x,\omega ; S^l) \ge c^\top x + {\mathcal {C}}^l_i(x,\omega )\) for all \(i \in {\mathcal {J}}_l\). Then the following holds:

  1. a.

    If \(k_{l+1} = k_l\), then \({\hat{f}}^{k_{l+1}}_{l+1}(x,\omega ;S^l) \ge c^\top x + C^l_i(x,\omega ) - \frac{M_h}{k_l}\), \(i \in {\mathcal {J}}_l\).

  2. b.

    If \(k_{l+1} > k_l\), then \({\hat{f}}^{k_{l+1}}_{l+1}(x,\omega ;S^{l+1}) \ge c^\top x + \frac{k_{l}}{k_{l+1}} C^l_i(x,\omega )\), \(i \in {\mathcal {J}}_l\).

Proof

See Appendix A.4. \(\square \)

Remark 1

Note that in the beginning, \(\frac{M_h}{k_{l}}\) may be so large that the updated cuts remain below zero for a large portion of \(x \in X\). But as \(\ell \) increases, \(k_l\) will become large enough so that the effect of \(\frac{M_h}{k_{l}}\) becomes negligible.

Remark 2

There is no loss of generality with the assumption that h is bounded below by 0. This follows from the fact that in the more general case where h is bounded below by the constant, \(m_h\), the update rule can be modified to the following form:

  1. a.

    If \(k_{l+1} = k_l\), then \({\hat{f}}^{k_{l+1}}_{l+1}(x,\omega ;S^l) \ge c^\top x + C^l_i(x,\omega ) - \frac{M_h - m_h}{k_l}\), \(i = 1,2,\ldots ,l\).

  2. b.

    If \(k_{l+1} > k_l\), then \({\hat{f}}^{k_{l+1}}_{l+1}(x,\omega ;S^{l+1}) \ge c^\top x + \frac{k_{l}}{k_{l+1}} C^l_i(x,\omega ) + \frac{k_{l+1} - k_l}{k_{l+1}} m_h\), \(i = 1,2,\ldots ,l\).

By the minorant construction and updates in the SD-kNN algorithm, it is easy to verify that \(\check{f}_l(\cdot ,\omega )\) is the outer approximation of \({\hat{f}}^{k_l}_l(\cdot ,\omega )\), which is summarized in the following lemma.

Lemma 4.3

Suppose that assumptions B1B8 are satisfied. The following hold:

  1. a.

    \( h_l(x,\xi ) \le h(x,\xi ), \text { for all } x \in X, \ l \in {\mathbb {N}}\).

  2. b.

    \(c^\top x + {\mathcal {C}}^l_l(x,\omega ) \le f^{k_l}_l(x,\omega )\), for all \(x \in X\), \(l \in {\mathbb {N}}\). \(c^\top x + {\mathcal {C}}^l_{-l}(x,\omega ) \le f^{k_l}_{-l}(x,\omega )\), for all \(x \in X\), \(l \in {\mathbb {N}}\).

  3. c.

    \(\check{f}_l(x,\omega ) \le {\hat{f}}^{k_l}_l(x,\omega )\), for all \(x \in X\), \(l \in {\mathbb {N}}\).

4.2 Convergence results

The theoretical analysis of the SD-kNN algorithm consists of three steps. In the first step, we show the asymptotic convergence of the generated minorants at any accumulation point. In the next step, we show the subsequence convergence of candidates and incumbents and their directional derivatives corresponding to piecewise linear approximations of objective function generated by SD-kNN. The third step is to show that the sequence of incumbent solutions generated by SD-kNN will converge to the true optimal solution to (15) with probability one. For the remainder of this section, we let \(\{x^l\}_{l=1}^\infty \) and \(\{{\hat{x}}^l\}_{l=1}^\infty \) denote the sequences of master problem and incumbent solutions respectively, identified by SD-kNN.

Theorem 4.4

(Asymptotic Convergence of the Minorants) Suppose assumptions B1B8 hold. Let \(\{(\alpha ^{l}_{l},\beta ^l_l)\}_l\) and \(\{(\alpha ^{l}_{-l},\beta ^l_{-l})\}\) be the sequence of coefficients calculated in the minorant construction of SD-kNN (Algorithm 3). With probability one, the following hold:

  1. a.

    For any accumulation point, \({\hat{x}}\), of \(\{x^l\}_{l=1}^\infty \), one has

    $$\begin{aligned} \begin{aligned} \lim _{n \rightarrow \infty } \alpha ^{l_n}_{l_n} + (\beta ^{l_n}_{l_n})^\top x^{l_n} = {\mathbb {E}}_{{\tilde{\xi }} \sim K(\omega ,\cdot )}[h({\hat{x}},{\tilde{\xi }}) \vert {\tilde{\omega }} = \omega ], \end{aligned} \end{aligned}$$

    where \(\{x^{l_n} \}_{n=1}^{\infty }\) is an infinite subsequence of \(\{x_l\}_{l=1}^\infty \) and \(\lim \limits _{n \rightarrow \infty } x^{l_n} = {\hat{x}}\).

  2. b.

    For any accumulation point, \({\hat{x}}'\), of \(\{{\hat{x}}^l\}_{l=1}^\infty \), one has

    $$\begin{aligned} \begin{aligned} \lim _{n' \rightarrow \infty } \alpha ^{l_{n'}}_{-l_{n'}} + (\beta ^{l_{n'}}_{-l_{n'}})^\top {\hat{x}}^{l_{n'}} = {\mathbb {E}}_{{\tilde{\xi }}\sim K(\omega ,\cdot )}[h({\hat{x}}',{\tilde{\xi }}) \vert {\tilde{\omega }} = \omega ], \end{aligned} \end{aligned}$$

    where \(\{{\hat{x}}^{l_{n'}} \}_{n'=1}^{\infty }\) is an infinite subsequence of \(\{{\hat{x}}_l\}_{l=1}^\infty \) and \(\lim \limits _{n' \rightarrow \infty } {\hat{x}}^{l_{n'}} = {\hat{x}}'\).

Proof

See Appendix A.5. \(\square \)

One direct consequence of Theorem 4.4 is that the estimated solution obtained by the SD-kNN algorithm will generate objective function estimates which converge to the true objective function value at any accumulation point of the sequence of incumbent solutions with probability one. This result is summarized in the following corollary.

Corollary 4.5

Suppose assumptions B1B8 hold. With probability one, for any accumulation point, \({\hat{x}}\), of \(\{{\hat{x}}^l \}\), the following holds:

$$\begin{aligned} \lim _{n \rightarrow \infty } \check{f}_{l_n}({\hat{x}}^{l_n},\omega ) = \lim \limits _{n \rightarrow \infty } \check{f}_{l_n+1}({\hat{x}}^{l_n},\omega ) = f({\hat{x}},\omega ), \end{aligned}$$

where \(\{{\hat{x}}^{l_n}\}_{n=1}^\infty \) is the subsequence of \(\{{\hat{x}}^l\}_{l=1}^\infty \) such that \(\lim \limits _{n \rightarrow \infty } {\hat{x}}^{l_n} = {\hat{x}}\).

Proof

See Appendix A.6. \(\square \)

The lemma below establishes an important relationship between the directional derivative of the true objective function and the directional derivative of its piecewise linear approximation generated by SD-kNN.

Lemma 4.6

Suppose that assumptions B1B8 hold and \(\sigma \ge 1\). For a given \(x \in X\), let \(d^l(x) = \frac{ x - x^{l+1}}{\Vert x - x^{l+1}\Vert }\). Then with probability one, the following holds:

  1. a.

    There exists an infinite subsequence of iterations \({\mathcal {L}} \subset {\mathbb {N}}\) such that \(\lim \limits _{l \in {\mathcal {L}}} x^{l+1} = \lim \limits _{l \in {\mathcal {L}}} {\hat{x}}^l = {\hat{x}}\), for some \({\hat{x}} \in X\).

  2. b.

    \(\limsup \limits _{l \in {\mathcal {L}}} \check{f}'_l(x^{l+1},\omega ;d^l(x)) \le f'({\hat{x}},\omega ; \frac{x - {\hat{x}}}{\Vert x - {\hat{x}} \Vert })\).

Proof

See Appendix A.7. \(\square \)

With Lemma 4.6, we can prove the asymptotic convergence of the SD-kNN algorithm in the theorem below.

Theorem 4.7

(Asymptotic Convergence of SD-kNN) Suppose that assumptions B1B8 hold and \(\sigma \ge 1\). With probability one, the sequence \(\{{\hat{x}}^l\}_{l=1}^\infty \) converges to an optimal solution of the problem (15) with probability one.

Proof

We first show that SD-KNN produces an infinite subsequence of iterations \({\mathcal {L}} \subset {\mathbb {N}}\) such that \(\{x^{l+1}\}_{l \in {\mathcal {L}}}\) and \(\{{\hat{x}}^l \}_{l \in {\mathcal {L}}}\) converges to the same optimal solution to (15). This result follows from Lemma 4.6 and [28, Theorem 5].

Finally, the almost sure convergence of \(\{{\hat{x}}^l\}_{l=1}^\infty \) follows from [47, Theorem 1]. \(\square \)

Remark

Based on the extension of SD to solve two-stage stochastic quadratic-quadratic programming (SQQP) problems by [36], it is also quite straigtforward to extend SD-kNN algorithm to solve two-stage predictive SQQP problems.

5 Numerical experiments

Both LEONA algorithm and SD-kNN algorithm are implemented in the C++ environment. CPLEX v12.8.0, as a base solver, is used to solve convex quadratic programming / linear programming problems and obtain associated dual multipliers. All the programs are run on the Macbook Pro 2017 with 3.1 GHz Dual-Core Intel Core i5.

In the SD-kNN, we design a dynamic step size rule based on [48]. Namely, if the incumbent selection is passed (i.e., \(\check{f}_l(x^l,\omega ) - \check{f}_l({\hat{x}}^{l-1},\omega ) \le q \left[ \check{f}_{l-1}(x^l,\omega ) - \check{f}_{l-1}({\hat{x}}^{l-1},\omega )\right] \)), then we let \(\sigma _l = \max \{ \frac{1}{2}\sigma _l, {\underline{\sigma }}\}\); otherwise, we let \(\sigma _l = \min \{ 2 \sigma _l, {\overline{\sigma }}\}\), where \(1 \le {\underline{\sigma }}< {\overline{\sigma }} < \infty \). In the actual implementation, we let \(q = 0.5\), \({\underline{\sigma }} = 1\), and \({\overline{\sigma }} = 100\). To determine the initial incumbent solution of the SD-kNN, we design a presolve process. In particular, we use certain number of data points to compute the kNN estimate of the response, \({\bar{\xi }}\vert _{{\tilde{\omega }} = \omega }\). Let \({\bar{N}}\) denote the sample size for the presolve process, and let \({\bar{k}} = \lfloor {\bar{N}}^\beta \rfloor \) denote the corresponding number of nearest neighbors, the mathematical formulation is \({\bar{\xi }}\vert _{{\tilde{\omega }} = \omega } = \frac{1}{{\bar{k}}} \sum _{i=1}^{{\bar{N}}} {\mathbb {I}}\left( \omega _i \in {\mathcal {S}}({\bar{k}},\omega ; \{\omega _j \}_{j=1}^{{\bar{N}}}) \right) \xi _i. \) Then the initial incumbent solution, \(x^0\) is obtained by solving a deterministic two-stage problem with \({\bar{\xi }}\vert _{{\tilde{\omega }} = \omega }\) as the realization of the second-stage scenario (i.e., \(x^0 \in \arg \min \{c^\top x + h(x,{\bar{\xi }}\vert _{{\tilde{\omega }} = \omega }): x \in X\}\)).

To validate the performance of the SD-kNN algorithm, we implemented a multi-cut Benders decomposition (see [4, Chapter 6.5]) to sequentially solve the PSP instances each time when a new sample is observed, which we refer to as online multi-cut Benders decomposition (OMBD). Based on OMBD, we implemented a variation that records all the previously calculated cuts. We shall refer to this variation as online Benders decomposition (OBD). The pseudocode of both benchmark algorithms are presented in Appendix B.

5.1 Two-stage predictive SLP

Here, we demonstrate the computational performance of LEONA and SD-kNN by solving two instances: a) a two-stage shipment planning problem, which first appeared in [3] and b) a two-stage multiproduct inventory problem, which is a modification of single-period multiproduct inventory model proposed by [2]. Throughout this section, we refer to the two-stage shipment planning problem and two-stage multiproduct inventory problem as BK19 and P-BAA99, respectively. The problem dimensions are summarized in Table 1. In particular, BK19 has 4 first-stage decision variables and 4 inequality constraints in the first stage, whereas, it has 52 inequality constraints (excluding non-negativity constraints) and 16 decision variables in one realization of the second-stage subproblem. Model P-BAA99 has 25 decision variables and 50 inequality constraints in the first stage, and it has 1,375 equality constraints and 100 decision variables in one realization of the second-stage subproblem. We replicate each algorithm 30 times (i.e., with 30 different random seeds) for each instance.

Table 1 Sizes of the two-stage Predictive SLP instances

For simplicity, we let LEONA-E, LEONA-Q, LEONA-G, LEONA-N, and LEONA-k denote the LEONA with Epanechnikov kernel, quartic kernel, Gaussian kernel, Naïve kernel, and kNN method, respectively. The sample size is chosen to be consistent with number of NSQG calls in LEONA. To set the experiment in the presence of streaming data, we let the size of the initial observations be 50. Each time, a new sample will be observed. In LEONA-kNN and SD-kNN, \(\beta \) is set to be 0.6. In LEONA with kernel estimation, \(\beta \) is set to be \(0.5/m_1\). As for BK19, C is set to be 1 for LEONA with Gaussian Kernels, while C is set to be 1 for LEONA with other kernels in \({\mathcal {K}}^{kernel}\). As for P-BAA99, C is set to be 10 for LEONA with Gaussian kernel, while C is set to be 50 for LEONA with other kernels in \({\mathcal {K}}^{kernel}\). The scale of the averaging window (m) in LEONA is set to be 1. As a result, the total number of samples used for LEONA is 8,645 after 13 outer iterations.

5.1.1 Two-stage shipment planning problem

In BK19, the observed predictor is set to \(\omega = (-1.1401, 0.3406, 1.3871)\) and the estimated true cost is calculated by using 100,000 samples from the true conditional distribution (see [3] for the data generation process). As a result, the estimated cost of \({\hat{x}}\) is \(\theta _v ({\hat{x}};\omega ) = \frac{1}{{\mathcal {N}}_v} \sum _{i=1}^{{\mathcal {N}}_v} F({\hat{x}},\xi _i^\omega )\) and \(c^*(\omega ) = \min \limits _{x\in X} \frac{1}{{\mathcal {N}}_v}\sum _{i=1}^{{\mathcal {N}}_v} F(x,\xi _i^\omega )\) is the estimated “optimal" cost of the perfect-foresight problem. In particular, \(c^*(\omega )\) is 772.152.

Table 2 BK19 (sample size \(=\) 8,645)
Fig. 1
figure 1

Computational Results for BK19. a Validated solution quality of various algorithms versus sample size. b Impact of the Dimension of the Predictor for BK19 (sample size is 8645)

Table 2 and Figs. 1, 2 summarize the computational performance of the LEONA and SD-kNN in the instance of BK19. In Table 2, we summarize the average performance of each algorithm when the total number of observed samples is 8,645. Accordingly, SD-kNN outperforms LEONA in terms of solution quality and stability while requiring more computational time. The higher time consumption of SD-kNN is due to the fact that it updates the solution for each new observation of the covariate while LEONA updates the solution only for a certain batch of samples. On the other hand, OMBD/OBD, the benchmark algorithm, has only marginally better solution (by a fraction of a percent) compared with SD-kNN, at the cost of much greater computational time (i.e., 3315.64s/159.70s vs. 82.63s). For streaming data, implementing SD requires only one iteration between updates of data to ensure asymptotic convergence of the process. In this sense, it is appropriate to estimate the time per iteration to ascertain whether it is feasible to implement SD-kNN for streaming data applications. Consequently, we also report the time per iteration of SD to examine whether SD may provide a realistic approach for streaming data applications. We observe that the time per iteration of SD-kNN is 0.01s which seems reasonable for streaming data applications. Accordingly, it is appropriate to study trade-offs between SD-kNN and the two first-order (SQG) methods for streaming data applications. Figure 1a shows that SD-kNN converges faster than LEONA in terms of the number of samples used. It is likely that the smaller sample size observed for SD-kNN can be attributed to the piecewise linear approximations of the objective function in SD compared with the linear (first-order) approximations of LEONA.

Fig. 2
figure 2

Computational Results of SD-kNN in BK19. a Total number of explored dual extreme points versus number of solution updates in SD-kNN. b Number of minorants versus number of solution updates in SD-kNN. In both figures, the solid line is the average of the corresponding quantity and the shadow area shows the difference between the largest and smallest associated quantities from 30 replications across the entire computational process

In Fig. 2a, we observe that the number of unique dual vertices stored by SD-kNN converges to a point around 105 as the number of solution updates increases. In Fig. 2b, we observe that the number of minorants stored is between 2 and 18 across the entire computational process from all of 30 replications. The reason that the curve of the average minorants decreases is because “inactive minorants” (at a new candidate solution) can be removed (see equation (16)) as recommended by SD theory [28]. This controls the number of stored minorants from increasing beyond a finite upper bound [28].

We provide experimental evidence on the impact of the dimension of the predictor on the performance of the algorithms for BK19. In BK19, the response is a linear transformation of the predictor with some white noise. We increase the dimension of the predictor by including some auxiliary random variables, which are characterized as AR(1) time series. It turns out that the response still follows the same pattern as in the original BK19. For each algorithm, we evaluate the out-of-sample objective cost of the estimated decisions when the sample size equals to a point from the set, {50, 153, 315, 545, 855, 1260, 1778, 2430, 3240, 4235, 5445, 6903, 8645, 10710, 13140, 15980, 19278, 23085, 27455, 32445} (i.e., the set of points is based on the sample sizes used for LEONA’s outer iteration numbers ranging from 1 to 20). We evaluate the sample efficiency based on the following two metrics: a) the required sample size for average optimality gap to drop within 5% of the estimated optimal cost; b) computational time. These metrics are summarized in Table 3.

The first line of the table reports the performance of the benchmark, OMBD. The computational time reported in the first row demonstrates that an online version of Benders decomposition is so time-consuming as to be unacceptable for streaming data. When \(m_1 = 15\), OMBD takes more than 55 min to reach the required precision (5% of optimal cost), while LEONA-E takes less than 7 s. Compared with LEONA, SD-kNN has better sample efficiency but with higher computational time. Compared with OMBD, SD-kNN has similar sample efficiency but with a much smaller computational time. Among the three algorithms compared in the study, LEONA takes the least time while requiring larger sample sizes for comparable accuracy. We observe that both LEONA-N and LEONA-k fail to provide decisions with the required precision when \(m_1 \ge 9\) and the maximum sample size is 32,445. In other words, the dimension of the predictors has greater impact on the performance of LEONA-N and LEONA-k. Overall, we observe that LEONA-E (Epanechnikov kernel) provides the most effective combination of estimation and optimization among various choices of kernels in our study. Moreover, we observe that the required sample for each algorithm generally increases as the dimension of the predictor increases. If the number of data points is a constant, then the solution quality for higher dimensional problems will generally be of a lower quality as shown in Fig. 1b.

Table 3 Sample efficiency of algorithms in BK19

5.1.2 Two-stage multiproduct inventory problem

Table 4 P-BAA99, sample size 5,445
Fig. 3
figure 3

Computational Results for P-BAA99. a Validated solution quality of various algorithms versus sample size. b Box plot of validated solution qualities (\(N=5445\))

In P-BAA99, the predictor and demand (i.e., response) altogether follow a truncated multivariate normal distribution with mean vector, \(\mu \), and covariance matrix, \(\Sigma \) (where the simulation is based on [7]). The observed predictor is \(\omega _i = 108 - \frac{i}{3}\), \(i = 1,2,\ldots ,25\). We utilize kNN to estimate the true cost of a given solution. Although such an approach will inevitably incur some bias in measuring solution quality, this approach is available in practice and its performance is ensured by the pointwise convergence of kNN estimator [51]. To initiate the validation for P-BAA99, we generate \(10^6\) predictor-response data pairs from unconditional distribution and find \(10^3\) nearest neighbors of the observed predictor. As a result, the associated estimated optimal cost (i.e., \(c^*(\omega )\)) is -14,262. The computational performance of LEONA and SD-kNN is summarized in Table 4 and Fig. 3. Figure 3a plots the average solution quality versus sample size used. Similar to the results in BK19, SD-kNN converges the fastest in terms of the samples used. We also create the box plots of each algorithm when the total sample size is 5,445 in Fig. 3b to visually showcase the distribution of the solution qualities of each algorithm. Based on Table 4 and Fig. 3b, we observe that LEONA-G has much more variation than other algorithms. On the other hand, LEONA-Q attains the best average solution quality and LEONA-E attains the lowest variation when the sample size is 5,445. We also find that OMBD consumes much more time than SD-kNN (11,461.08s vs 707.86s) to perform the solution updates without any perceptible improvement to solution quality. With recording all the previously calculated cuts, the computational time of OBD drastically decreases from 11,461.08s to 895.77s, but even that is considerably higher than the time required for SD-kNN (707.86s) without improving solution quality. For real-time applications, the LEONA family of algorithms may be the most promising, because it is likely that the time lag (of a few seconds) may be easier to tolerate for streaming data applications. This encouraging observation is the first of its kind to be reported in the literature.

6 Conclusions

Implementing a streaming data algorithm requires more care in data structure because of the dynamic nature of information revealed in the algorithm. Depending on the calculations, each algorithm has its own way to manage streaming data and its impact on the algorithmic calculations. We provide a summary of key features of LEONA and SD-kNN in terms of standard streaming algorithms from the literature ([19, 21]). In LEONA, we have sliding windows of growing sizes updates (landmark windows in [19]) to reduce the workload of solution updates and use averaging to summarize the solution updates. In SD-kNN, we utilize several compact data structures to speed up data summaries for further solution updates (synopsis in [19]). In particular, we store \(L_0\) and \(L_1\) norms of all the unique dual extreme points to accelerate methods that recognize new dual extreme points. A sorted list of covariates also helps accelerate kNN estimation. To reduce redundant matrix calculations, we implement a data structure that records all the necessary matrix manipulations. Although the above expedients may seem trivial, their implementations are crucial to successful large-scale implementation.

To conclude, we present an assessment of the contributions which were mentioned in the Introduction. a) We design the LEONA algorithm which sequentially uses non-parametric stochastic quasi-gradients to update the estimated solutions to (2). We also design a fusion of non-parametric estimation and SD algorithm to further improve the efficiency of solution estimates. b) The convergence result is predicated on diminishing both statistical estimation bias and systematic optimality gap. c) Our results have been developed to apply to kNN and a variety of kernels (Naïve, Quartic, Epanechnikov and Gaussian) under one common umbrella. d) The numerical performance of the LEONA algorithm and the SD-kNN algorithm were studied using two-stage predictive stochastic linear programming problems. We observe that from the viewpoint of solution qualities, SD-kNN appears to be the best choice for these datasets, whereas LEONA-E and LEONA-Q appear to have the lowest latency (t(s)).

We believe that there is great potential of the practical use of PSP algorithms for streaming data applications where the distribution adapts automatically to incoming data due to non-parametric estimation methods. One possible theoretical extension is to derive the finite sample complexity as well as “regret" of both LEONA and SD-kNN algorithm based on [5]. On the application side, we may apply our methodology in the economic dispatch problem [20] in which the prediction of wind power (using non-parametric estimation) can be included in the design of policy to dispatch conventional generation units in real time. All in all, we expect this combination of estimation and optimization to be a fundamental step towards distribution-free algorithms in SP with covariates. Is this the start of new era of “estimization" (estimation + optimization)?