Distribution-free algorithms for predictive stochastic programming in the presence of streaming data

Diao, Shuotao; Sen, Suvrajeet

doi:10.1007/s10589-023-00529-5

Distribution-free algorithms for predictive stochastic programming in the presence of streaming data

Open access
Published: 22 September 2023

Volume 87, pages 355–395, (2024)
Cite this article

Download PDF

You have full access to this open access article

Computational Optimization and Applications Aims and scope Submit manuscript

Distribution-free algorithms for predictive stochastic programming in the presence of streaming data

Download PDF

1114 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

This paper studies a fusion of concepts from stochastic programming and non-parametric statistical learning in which data is available in the form of covariates interpreted as predictors and responses. Such models are designed to impart greater agility, allowing decisions under uncertainty to adapt to the knowledge of predictors (leading indicators). This paper studies two classes of methods for such joint prediction-optimization models. One of the methods may be classified as a first-order method, whereas the other studies piecewise linear approximations. Both of these methods are based on coupling non-parametric estimation for predictive purposes, and optimization for decision-making within one unified framework. In addition, our study incorporates several non-parametric estimation schemes, including k nearest neighbors (kNN) and other standard kernel estimators. Our computational results demonstrate that the new algorithms proposed in this paper outperform traditional approaches which were not designed for streaming data applications requiring simultaneous estimation and optimization as important design features for such algorithms. For instance, coupling kNN with Stochastic Decomposition (SD) turns out to be over 40 times faster than an online version of Benders Decomposition while finding decisions of similar quality. Such computational results motivate a paradigm shift in optimization algorithms that are intended for modern streaming applications.

Predictive stochastic programming

Article 31 July 2021

Multiscale stochastic optimization: modeling aspects and scenario generation

Article Open access 11 October 2019

Distributions and bootstrap for data-based stochastic programming

Article 12 May 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Non-parametric statistics [53] has broad applications in density estimation [30], regression [25] and many engineering applications such as image classification [38] and natural language processing [50]. Due to its distribution-free and predictive characteristics, non-parametric estimators are imbued with predictive power as well as mild requirements on oracles for the probability distribution. Recently, the study of stochastic programming (SP) models with predictor/response data has begun to attract attention due to modern applications with evolving/streaming data [3, 10, 14, 26, 29, 31,32,33, 41, 42]. One streaming data example could be real-time renewable power generation (wind and solar), which can be guided by real-time demand data as well as real-time power production from renewable resources. Following Deng and Sen [10], we refer to such SP models as Predictive SP (PSP), which hold the promise of providing greater agility and adaptivity for SP models. While the predictive feature adds agility to decision-making, there is one particularly insidious aspect of such models: simply averaging finitely-many functions with equal weights (as in Sample Average Approximation (SAA)) will, in general, produce a constant bias that may not reduce even when the sample size increases indefinitely (see Bertsimas and Kallus [3] for more details).

To set the stage for our development, we consider the PSP problem where a vector of random model parameters, ${\tilde{\xi }}$, depends on a predictor (random variable), ${{\tilde{\omega }}}$, and such predictors can be observed before the realization of ${{\tilde{\xi }}}$. For instance, ${\tilde{\omega }}$ is a random vector of temperature and humidity, while ${\tilde{\xi }}$ denotes precipitation [43].

In order to put the discussion on a firm mathematical footing, we first introduce the mathematical setting. Let ${\mathcal {X}}$ be a Borel subset of ${\mathbb {R}}^{m_1}$ and let ${\mathcal {Y}}$ be a Borel subset of ${\mathbb {R}}^{m_2}$. Let $(\Omega , \Sigma _{\Omega }, {\mathbb {P}})$ be the probability space of the correlated continuous random variables ${\tilde{\omega }}$ and ${\tilde{\xi }}$ taking the values in the measurable spaces, $({\mathcal {X}}, \Sigma _{{\mathcal {X}}})$ and $({\mathcal {Y}}, \Sigma _{{\mathcal {Y}}})$, respectively. In particular, the tuple $({\tilde{\omega }},{\tilde{\xi }}): \Omega \mapsto {\mathcal {X}} \times {\mathcal {Y}}$ is a random variable taking values in the product space $({\mathcal {X}} \times {\mathcal {Y}}, \Sigma _{{\mathcal {X}}} \otimes \Sigma _{\mathcal {Y}})$. While the underlying distribution of predictor-response pairs $({\tilde{\omega }},{\tilde{\xi }})$ will not be assumed to be known, we will assume that a dataset of outcomes $S^{{\mathcal {N}}} \triangleq \{(\omega _1,\xi _1), (\omega _2,\xi _2), \ldots , (\omega _{{\mathcal {N}}}, \xi _{\mathcal {N}})\}$ is available, and more observations become available as ${\mathcal {N}} \rightarrow \infty $. Moreover, the regular conditional distribution of ${\tilde{\xi }}$ given ${\tilde{\omega }} = \omega $, denoted $K(\omega ,\cdot )$, is assumed to exist^{Footnote 1} and,

$$\begin{aligned} f(x, \omega ) \triangleq {\mathbb {E}}_{{\tilde{\xi }} \sim K(\omega ,\cdot )}[F(x,{\tilde{\xi }}) \vert {\tilde{\omega }} = \omega ] \triangleq \int _{{\mathcal {Y}}} F(x,\xi ) K(\omega ,d \xi ) \end{aligned}$$

(1)

is well defined and finite valued for any $x \in X$. In ordinary non-parametric regression, we fix x and $\omega $ and then use $\{(F(x,\xi _1),\omega _1), \ldots , (F(x,\xi _{{\mathcal {N}}}),\omega _{{\mathcal {N}}} )\}$ from ${\mathcal {S}}^{{\mathcal {N}}}$ to construct a non-parametric regression of ${\mathbb {E}}_{{\tilde{\xi }} \sim K(\omega ,\cdot )}[F(x,{\tilde{\xi }}) \vert {\tilde{\omega }} = \omega ]$ in (1). Instead of simply estimating $f(x,\omega )$ in (1), our goal is to provide an efficient methodology to support simultaneous estimation and optimization in online settings where data and decision process evolve to attain asymptotically optimal decision to the following PSP problem.

$$\begin{aligned} x^*(\omega ) \in \arg \min _{x \in X} \ f(x, \omega ). \end{aligned}$$

(2)

Note that $\omega $ is a fixed parameter in both (1) and (2). In the context of applications in reservoir management, x may denote the operating policy of the reservoir under the current observation of predictors such as humidity and temperature, and $f(x,\omega )$ is the operating cost to be optimized.

In contrast to this paper, the works of Deng and Sen [10] and Kannan et al. [32] first discover a statistical model (i.e., a regression of ${\tilde{\xi }} \vert {\tilde{\omega }})$ and then incorporate that model within optimization. This is a two-step process referred to as Learning Enabled Optimization (LEO).

A somewhat different variant of the LEO setup is one where the regression is decision-dependent, and calls for a local regression approximating the relationship between ${{\tilde{\xi }}}$ and x (i.e., ${\tilde{\xi }} \vert x$). That approach, which appears in [37], is referred to as Coupled Learning Enabled Optimization (CLEO), and is appropriate when one has the ability to generate data sets by querying a simulation.

A model which may be considered as the closest relative to (2) is studied by Bertsimas and Kallus [3]. It may be interpreted as an estimate-then-optimize strategy which uses non-parametric regression of the objective function. As observed in [46], this is also a two-step process where the non-parametric estimation of the objective function precedes the optimization process. Their study involves the asymptotic properties of the solution to the following problem:

$$\begin{aligned} {\hat{x}}^{{\mathcal {N}}}(\omega ) \in \arg \min _{x \in X} \ {\hat{f}}_{{\mathcal {N}}}(x,\omega ; S^{{\mathcal {N}}}) \triangleq \sum _{i=1}^{{\mathcal {N}}} v_{{\mathcal {N}},i}(\omega ) F(x,\xi _i). \end{aligned}$$

(3)

Here, ${\hat{f}}_{{\mathcal {N}}}(x,\omega ; S^{{\mathcal {N}}})$ denotes an approximation which is parameterized by x and $\omega $ and depends on the dataset $ S^{{\mathcal {N}}}$. The weight function (i.e., $v_{{\mathcal {N}},i}(\omega )$) suggested by [3] is calculated by some non-parametric method, such as k nearest neighbors estimation, or kernel estimation, and depend on $\omega $ and $S^{\mathcal {N}}$. In other words, the data-driven collection of functions $\{(\omega _i, F(x,\xi _i))\}_{i=1}^{{\mathcal {N}}}$ is used to construct an estimate of $f(x,\omega )$. Thus, equation (3) should be considered as a non-parametric analog of ordinary Sample Average Approximation (SAA) used to approximate ordinary SP. In [3], the authors show that under certain assumptions (e.g., equi-continuity of the random objective function), solving (3) provides a statistically consistent estimate of an optimal solution to (2) (i.e., $\lim \limits _{{\mathcal {N}} \rightarrow \infty } {\hat{x}}^{{\mathcal {N}}}(\omega ) = x^*(\omega ) \ w.p. 1$). The consistency of such non-parametric estimation does not rely on specific assumptions regarding any particular distribution, and just as importantly, these approximations can be setup in a completely automated fashion, without much preprocessing. From the perspective of approximation theory this is very encouraging for estimating a decision model. In this sense, the results of [3] are analogous to the work of [34] for using ordinary SAA for ordinary SP, and both require some numerical algorithm to deliver decisions. It is important to observe that for streaming data applications, it is burdensome to recursively recompute the entire problem as new data points are observed. This handicap is corroborated by the computational results presented in this paper.

Although non-parametric estimation methods have many applications in pattern recognition, image processing, weather forecasting and many others, their prior use in stochastic programming (decision models) has been limited to [24], where kernel-type estimators (for the unknown distribution) are used in a simple recourse model (sometimes referred to as as $L_1$ minimization in machine learning) with a fixed dataset and standard nonlinear programming techniques. In the aforementioned model, non-parametric estimation plays a static role in the sense that once the optimization model is setup, the non-parametric estimate is treated as a fixed model parameter. It is worth noting that the well-known bias of the non-parametric estimation makes its algorithmic use more or less uncharted territory. Indeed, our setup, which is a simultaneous estimation and optimization approach, allows sequential updates to decisions as the data stream evolves, thus making time-constrained (or real-time) decision-making a realistic possibility. The road-blocks to solving (2) within the above context arise due to three challenges:

(a)
Is there an implementable algorithm which can automatically refine the non-parametric estimation while the data process evolves with increasing ${\mathcal {N}}$?
(b)
Given a dataset $S^{ {\mathcal {N}}}$ suppose that the joint estimation-optimization step produces a decision $x( S^{{\mathcal N}} )$. What properties should the joint estimation-optimization step impart to the sequence $\{ x( S^{{{\mathcal {N}}}} )\}$ so that we obtain an optimal solution to (2) as ${\mathcal {N}} \rightarrow \infty $? In other words, how should one design an optimization algorithm to ensure that the bias of non-parametric estimation does not adversely impact its asymptotic convergence?
(c)
Is it possible to exploit dynamics of the algorithm to recommend decision updates as the dataset evolves in size? How should one balance out both estimation and optimization efficiency?

This paper is devoted to answering these important questions which arise from simultaneous estimation and optimization.

In this paper, we develop two algorithms for solving PSPs: one built around extensions of biased first-order methods and another based on piecewise linear value function approximations. The former may be looked upon as an extension of stochastic approximation, or more accurately, stochastic quasi-gradient methods (SQG, [15]), while the next class of methods may be looked upon as a counterpart of stochastic decomposition (SD, [28]) with non-parametric estimation. It is important to recognize that by their very nature, non-parametric procedures typically lead to biased estimates in finite time, and as a result, a combination of estimation and optimization algorithms must be designed in a manner that can deliver an unbiased optimal decision (and value), without losing asymptotic optimality associated with either traditional SQG or SD.

The contributions of this paper are:

1)
An algorithm for solving PSP which goes “beyond using sample frequency" to solve SP problems with conditional expectation, given ${\tilde{\omega }} = \omega $. The first-order PSP algorithm that we study, referred to as LEONA,^{Footnote 2} allows non-parametric estimation as an integral part of a solution algorithm for PSP. The LEONA algorithm provides the possibility of rapid response to streaming data, thus extending the horizon of SP algorithms to new applications.
2)
We propose a decomposition-based algorithm which combines the power of the SD algorithm with k nearest neighbors estimation (SD-kNN). To the best of our knowledge, this is the first decomposition-based algorithm that uses non-parametric estimation to attain the predictive capability required by PSP.
3)
We provide a general framework to control the errors from the simultaneous-estimation-optimization process. In [3], the emphasis is on the sequence of optimal solutions to each approximation, whereas our approach studies the asymptotic properties of the sequence of estimated solutions once per update in the streaming data process. Addressing such symbiotic structures is a major innovation of this work because it highlights the interplay of errors arising from statistical estimation and mathematical optimization. Preliminary computational evidence suggests the possibility of successful applications for rapid response in the presence of streaming data. The computational results show that both LEONA and SD-kNN algorithms can provide near-optimum solutions for a class of problems studied in this paper. Further computational results also show that the loss of the estimation quality of our algorithms is minimal.
4)
In our computational results, we present numerical experiments using LEONA, SD-kNN, and multi-cut Benders decomposition as a benchmark. These experiments suggest that the new algorithms are able to provide faster and more reliable decisions for streaming data and comparisons with a standard benchmark (multi-cut Benders decomposition) reveal the potential power of the new algorithms with simultaneous estimation and optimization. To the best of our knowledge, these experiments are the first such results in the literature.

The remainder of the paper is organized as follows. In Sect. 2, we provide a brief review on non-parametric estimation methods. In Sect. 3, we present the LEONA algorithm and its convergence results. In Sect. 4, we introduce the algorithm design of SD-kNN to solve two-stage predictive stochastic linear programming problems. Finally, we demonstrate the computational performance of both simultaneous estimation-optimization algorithms in Sect. 5.

1.1 Notations

Since we are introducing notations in optimization and non-parametric estimation, we start the process of merging concepts via the notations introduced below.

To be consistent with the optimization literature, we use x for the decision variable and X for the set to which x belongs.^{Footnote 3} Let the joint distribution of $({\tilde{\omega }},{\tilde{\xi }})$ be $\mu _{{\tilde{\omega }}, {\tilde{\xi }}}$. Let $\mu _{{\tilde{\omega }}}$ and $\mu _{{\tilde{\xi }}}$ denote the marginal distribution functions of ${\tilde{\omega }}$ and ${\tilde{\xi }}$, respectively. We let $\xi $ and $\omega $ denote the realizations of ${\tilde{\xi }}$ and ${\tilde{\omega }}$, respectively. Let $g(\omega ,\xi )$ be the joint density function at $({\tilde{\omega }},{\tilde{\xi }}) = (\omega , \xi )$, and $\mu _{{\tilde{\omega }}, {\tilde{\xi }}}$ has the form, $\mu _{{\tilde{\omega }}, {\tilde{\xi }}}(d\omega ,d\xi ) = g(\omega ,\xi ) d\omega d\xi , \ (\omega ,\xi ) \in {\mathcal {X}} \times {\mathcal {Y}}$. Let $m(\omega ) = \int _{{\mathcal {Y}}} g(\omega ,\xi ) d \xi $ and suppose that the conditional distribution given that ${\tilde{\omega }} = \omega $ is as follows (see [9] for details).

$$\begin{aligned} k(\xi \vert \omega ) = {\left\{ \begin{array}{ll} \frac{g( \omega , \xi )}{ m (\omega )} \quad &{}\text {if} \ m(\omega ) > 0, \\ \int _{{\mathcal {X}}} g(\omega ',\xi ) d{\omega '} \quad &{}\text {if} \ m(\omega ) = 0. \end{array}\right. } \end{aligned}$$

(4)

We use ${\mathbb {P}}$ and ${\mathbb {E}}$ to denote the probability and expectation operators, respectively. Specifically, the expectation ${\mathbb {E}}[\cdot ]$ taken in the algorithm is with respect to the Cartesian product of infinitely many probability spaces of i.i.d samples generated by the sampling procedure. The expectation of the bias is defined with respect to the Cartesian product of finitely many probability spaces which correspond to the dataset (${\tilde{S}}^N$) used. We let $N<< {\mathcal {N}}$ and let ${\tilde{S}}^N \triangleq \{({\tilde{\omega }}_i, {\tilde{\xi }}_i) \}_{i=1}^N$ denote a set of N i.i.d. copies of $({\tilde{\omega }},{\tilde{\xi }})$. By the convention followed in this paper, $S^N$ denotes a realization of ${\tilde{S}}^N$. We use the Euclidean distance to measure the distance between the two predictors. We let $\Vert \cdot \Vert $ denote the Euclidean norm of the vector and spectral norm of the matrix. Let ${\tilde{\omega }}^N_{[j]}(\omega )$ denote the $j^{th}$ closest neighbor of $\omega $ from a collection of random variables $\{{\tilde{\omega }}_{i} \}_{i=1}^N$. We let ${\mathbb {I}}({\mathcal {E}})$ denote an indicator function whose value is 1 if the event ${\mathcal {E}}$ is true and 0 otherwise.

Since the true objective function in (2) is defined as a function of the decision x and the predictor $\omega $, we define the directional derivative of such class of functions at $x \in {\mathbb {R}}^n$ in the direction of $d \in {\mathbb {R}}^n$ for a given parameter $\omega \in {\mathcal {X}}$.

Definition 1

Let $\zeta (x,\omega ): S \times {\mathcal {X}} \mapsto {\mathbb {R}}$, where S is an open convex set of ${\mathbb {R}}^n$ and ${\mathcal {X}}$ is a Borel subset of ${\mathbb {R}}^{m_1 }$. For a given $\omega \in {\mathcal {X}}$, suppose that the directional derivative of $\zeta (\cdot ,\omega )$ at $x \in S$ in the direction of $d \in {\mathbb {R}}^n$ exists and it is defined below:

$$\begin{aligned} \zeta '(x,\omega ;d) = \lim _{\tau \downarrow 0} \frac{\zeta (x + \tau d,\omega ) - \zeta (x,\omega )}{\tau }. \end{aligned}$$

2 Background on non-parametric statistical estimation

Statistical regression is used to model the relationship between predictors (sometimes referred to as independent variables) and responses (sometimes referred to as dependent variables). Without assuming any specific pre-determined structure of predictors and responses, non-parametric estimation is one branch of statistical regression which sets up models purely based on the provided sampled data. One of the main advantages of non-parametric estimation is that model-fitting is not required prior to the optimization process [18].

One of the non-parametric estimation methods is the k nearest neighbors method. As the name suggests, the mean of the k nearest neighbors (i.e., here, the word “nearest" is sensitive to the metric used to measure distance) of the predictor is used to as the estimate of the corresponding conditional expectation. After Fix and Hodges Jr [17] proposed the nearest-neighbor method in 1951, several articles [11, 12, 25, 51, 52] have contributed to the consistency of kNN regression. Among these, Walk [51] shows that the kNN estimate converges to its true conditional expectation almost surely under certain regularity conditions. We give the definition of kNN estimate of $f(x,\omega )$ below.

Definition 2

Let ${\mathcal {S}}(k_N,\omega , \{\omega _j \}_{j=1}^N))$ be the $k_N$ nearest neighbors of $\omega $ from the sample set $\{\omega _j \}_{j=1}^N$. The kNN estimate of $f(x,\omega )$ is defined as follows.

$$\begin{aligned} {\hat{f}}_{k_N,N}(x, \omega ;S^N) \triangleq \frac{1}{k_N} \sum _{i=1}^N {\mathbb {I}}\left( \omega _i \in {\mathcal {S}}(k_N,\omega ; \{\omega _j \}_{j=1}^N) \right) F(x,\xi _i). \end{aligned}$$

Note that ${\hat{f}}_{k_N,N}(x, \omega ;S^N)$ is a measurable function of $x, \ \omega $ and the data $S^N$ (see [25] for more details). We provide the formal statement of asymptotic convergence of kNN estimate of the objective function in (2) below.

Theorem 2.1

([51, Theorem 1]) Let ${\tilde{S}}^N = \{({\tilde{\omega }}_i,{\tilde{\xi }}_i)\}_{i=1}^N$ be identically and independently distributed (i.i.d.) random variables. Let $x \in X$ and suppose that ${\mathbb {E}}[\vert F(x,{\tilde{\xi }}) \vert ] < \infty $. Let $k_N$ be monotone increasing with N, $\lim _{N \rightarrow \infty } k_N = \infty $, $\lim _{N \rightarrow \infty } \frac{k_N}{N} = 0$ and $k_N$ varies regularly with exponent $\beta \in (0,1]$ (e.g., $k_N = \lfloor N^{\beta } \rfloor $, $\beta \in (0,1)$). The ties are broken randomly or by indices. Then the following holds:

$$\begin{aligned} \lim _{N \rightarrow \infty } {\hat{f}}_{k_N,N}(x,\omega ; {\tilde{S}}^N) = f(x,\omega ), \quad w.p.1. \end{aligned}$$

Another popular non-parametric regression method falls under the category of kernel methods. Specifically, some predefined kernel function is used to estimate the marginal density function. Since Nadaraya [39] and Watson [54] proposed the famous Nadaraya-Watson kernel estimate in 1964, it has been used in biostatistics [8], economics [6], image processing [49] etc. Several papers [11, 22, 23, 25, 35, 52] further refined the theory of kernel estimation. Let ${\mathcal {K}}^{kernel}$ denote a set of following kernel functions:

1.
Naïve kernel: $K^{{\mathcal {N}}}(z) \triangleq {\mathbb {I}}_{\{\Vert z\Vert \le 1\}}$.
2.
Epanechnikov kernel: $K^{{\mathcal {E}}}(z) \triangleq (1 - \Vert z\Vert ^2){\mathbb {I}}_{\{\Vert z\Vert \le 1\}}$.
3.
Quartic kernel: $K^{{\mathcal {Q}}}(z) \triangleq (1 - \Vert z\Vert ^2)^2{\mathbb {I}}_{\{\Vert z\Vert \le 1\}}$.
4.
Gaussian kernel: $K^{{\mathcal {G}}}(z) \triangleq e^{-\frac{\Vert z\Vert ^2}{2}}$.

Definition 3

Let $K \in {\mathcal {K}}^{kernel}$. The kernel estimation of $f(x,\omega )$ is defined as follows.

$$\begin{aligned} {\hat{f}}^{\textrm{kernel}}_N(x,\omega ; S^N) \triangleq \frac{\sum _{i=1}^N F(x,\xi _i) K(\frac{\omega - \omega _i}{h_N})}{\sum _{i=1}^N K(\frac{\omega - \omega _i}{h_N})}, \end{aligned}$$

(5)

Also note that ${\hat{f}}^{\textrm{kernel}}_N(x,\omega ; S^N)$ is a measurable function of x, $\omega $, and data $S^N$. The strong pointwise consistency property of the Nadaraya-Watson kernel regression estimate is summarized below.

Theorem 2.2

([52, Theorem 3]) Let $\{({\tilde{\omega }}_i,{\tilde{\xi }}_i)\}_{i=1}^N$ be identically and independently distributed (i.i.d.) random variables. Let the bandwidth be denoted $h_N > 0$ and $h_N= C N^{-\beta }, \ {C > 0}, \ \beta \in (0,\frac{1}{m_1})$ (i.e., $m_1$ is the dimension of $\omega $), for all $N \in {\mathbb {N}}$. Let $K \in {\mathcal {K}}^{\textrm{kernel}}$. Let $x \in X$ and suppose that ${\mathbb {E}}\left[ \vert F(x,{\tilde{\xi }}) \vert \max \{\log \vert F(x,{\tilde{\xi }})\vert ,0 \} \right] < \infty $. If the kernel is used to calculate $v_{N,i}(\omega )$, then the following holds:

$$\begin{aligned} \lim _{N \rightarrow \infty } {\hat{f}}^{\textrm{kernel}}_{N}(x,\omega ; {\tilde{S}}^N) = f(x,\omega ), \quad w.p.1. \end{aligned}$$

Remark

In [25, Theorem 6.2], Györfi et al. study the convergence rate of kNN estimation (i.e., ${\mathbb {E}}_{{\tilde{S}}^N}[\int _{{\mathcal {X}}} ( m^{k_N}_N(\omega ;{\tilde{S}}^N) - m(\omega ) )^2 \mu _\omega (d_\omega )]$), where $m^{k_N}_N(\omega ;{\tilde{S}}^N)$ is the kNN estimate of $m(\omega )$. The same comment applies to the kernel estimate (see [25, Theorem 5.2]) as well. We anticipate that one may be able to obtain refined convergence rates (see [5] for details) by studying this difference ($\sup \limits _{x \in X} \vert f^{k_N}_{N}(x,\omega ; {\tilde{S}}^N) - f(x,\omega ) \vert $), but it is out of the scope of the current paper.

3 Learning enabled optimization with non-parametric approximations

Hannah et al. [26] includes non-parametric estimation in the context of function-valued and gradient-valued estimation. However, their focus is on an empirical investigation of non-parametric estimation, instead of a framework for statistical and mathematical analysis for a theory to support asymptotic results. In the context of non-parametric sample average approximation, the asymptotic results of Bertsimas and Kallus [3] resolved the open question of statistical consistency of the model. Our paper goes a step further than [3] by incorporating simultaneous estimation and optimization in presence of streaming data.

3.1 Mathematical formulation

In this subsection, we formally set up the mathematical foundations for LEONA. We let $G(x,\xi )$ denote the subgradient of $F(x,\xi )$. Given $\omega $ as an observation of ${\tilde{\omega }}$, we introduce the following assumptions for ensuring the asymptotic properties of the LEONA algorithm.

(A0):: (Non-negativity of the weight function) Let $v_{N,i}: {\mathbb {R}}^{m_1} \mapsto {\mathbb {R}}$. The weight function is non-negative for all i (i.e., $v_{N,i}(\omega ) \ge 0$) and $\sum _{i=1}^N v_{N,i}(\omega ) \le 1$.
(A1):: (Probabilistic existence of ${\tilde{\omega }} = \omega $) The joint distribution function $\mu _{{\tilde{\omega }}, {\tilde{\xi }}}$ exists and the conditional distribution is defined by (4).
(A2):: (i.i.d assumption on data pairs) $({\tilde{\omega }}_1,{\tilde{\xi }}_1), ({\tilde{\omega }}_2,{\tilde{\xi }}_2), \ldots $ are independent and identically distributed random variables.
(A3):: (Compact and convex feasible region) $X \subset {\mathbb {R}}^n$ is a deterministic compact convex nonempty set.
(A4):: (Convex objective function and related boundedness properties) Let $F:X \times {\mathcal {Y}} \mapsto {\mathbb {R}}$. For every $\xi \in {\mathcal {Y}} \subset {\mathbb {R}}^{m_2}$, $F(\cdot ,\xi )$ is convex on X, and its subdifferential is bounded on X (i.e. there exits $M_G \in (0, \infty )$ such that $\Vert G(x,\xi )\Vert \le M_G$). Additionally, there exists $M_F \in (0, \infty )$ such that $\vert F(x,\xi ) \vert \le M_F$ for all $x \in X$ and $\xi \in {\mathcal {Y}}$.
(A5):: (Lipschitz continuity of the cost function) For any $\xi \in {\mathcal {Y}}$, $F(\cdot ,\xi )$ is Lipschitz continuous with an uniform constant $L_F$ on X (i.e., $\vert F(x,\xi ) - F(x',\xi ) \vert \le L_F \Vert x - x' \Vert $ for all $x, x' \in X$).
(A6):: Suppose that there exists $\delta _N(\cdot ,\cdot ) \ge 0$ such that $\vert {\mathbb {E}}[{\hat{f}}_{N}(x,\omega ; {\tilde{S}}^N)] - f(x,\omega ) \vert \le \delta _{N}(x,\omega )$ for all $x \in X$ and $\delta _N(\cdot ,\omega )$ converges uniformly to 0 on X, as $N \rightarrow \infty $.

Note that assumption A0 automatically holds in both kNN and kernel estimation. Assumption A6 requires that the bias from the non-parametric estimation must converge to 0 uniformly on the feasible region, which will be shown to hold in both kNN and kernel estimation under assumptions A1–A5. Note that assumption A6 does not impose a finite-sample bound on the estimation error.

We begin by stating a result from [13], which defines the comparison between the values of two conditional expectations.

Lemma 3.1

Let $(\Omega , \Sigma _\Omega , {\mathbb {P}})$ be a probability space. Let $\Sigma _0 \subset \Sigma _\Omega $ and $Y_1,Y_2 \in \Sigma _\Omega $. Suppose that ${\mathbb {E}}[\vert Y_1 \vert ] < \infty $ and ${\mathbb {E}}[\vert Y_2 \vert ] < \infty $. Then $Y_1 \le Y_2$ implies ${\mathbb {E}}[Y_1 \vert \Sigma _0] \le {\mathbb {E}}[Y_2 \vert \Sigma _0] \ \text {w.p.1}$.

Next, Lemma 3.2 summarizes that the true objective function (2) and its non-parametric estimation (3) are convex under the assumptions given above.

Lemma 3.2

(Convexity of objective functions) Suppose assumptions A0–A4 are satisfied. Then the following hold:

(1)
$f(\cdot ,\omega )$ is convex on X, for $\mu _{{\tilde{\omega }}}$ almost every $\omega \in {\mathcal {X}}$.
(2)
${\hat{f}}_N(\cdot ,\omega ;S^N)$ is convex on X.

Lemmas 3.1 and 3.2 establish the structure of the model (convexity) and at a conceptual level, it is amenable to large scale optimization. One consequence of convexity is that for points where the value is finite, the standard subdifferential of convex analysis exists. Ermoliev and Gaivoronski [15, 16] define a class of stochastic subgradients with a particular property of bias in classical SP (i.e., without covariates).

Definition 4

[15] Let $\hat{{\mathcal {G}}}_N(x; \{\xi _i\}_{i=1}^N)$ be a measurable function of x and $\{\xi _i \}_{i=1}^N$. Furthermore, let $x^* \in \arg \min \limits _{x \in X} f(x) \triangleq {\mathbb {E}}_{{\tilde{\xi }}}[F(x,{\tilde{\xi }})]$. Then $\hat{{\mathcal {G}}}_N(x;\{\tilde{\xi _i}\}_{i=1}^N)$ is a stochastic quasi-gradient of f(x) if it satisfies the following condition:

$$\begin{aligned} f(x^*) - f(x) \ge (x^* - x)^\top {\mathbb {E}}[\hat{{\mathcal {G}}}_N(x;\{\tilde{\xi _i}\}_{i=1}^N)] + \tau _N, \ \text {where} \ \lim _{N \rightarrow \infty } \tau _N = 0. \end{aligned}$$

(6)

We extend the notion of stochastic quasi-gradient to the case of SP with covariates. As a result, our approach opens a whole new area of applying non-parametric estimation for PSP.

Definition 5

Let $\hat{{\mathcal {G}}}_N(x,\omega ; S^N)$ be a measurable function of x, $\omega $, and $S^N$. Furthermore, let $x^*$ be an optimal solution of (2). Then $\hat{{\mathcal {G}}}_N(x,\omega ;{\tilde{S}}^N)$ is a stochastic quasi-gradient of $f(x,\omega )$ if it satisfies the following condition:

$$\begin{aligned} f(x^*,\omega ) - f(x,\omega ) \ge (x^* - x)^\top {\mathbb {E}}[ \hat{{\mathcal {G}}}_N(x,\omega ;{\tilde{S}}^N)] + \tau _N, \ \text {with} \ \lim _{N \rightarrow \infty } \tau _N = 0. \end{aligned}$$

(7)

Unlike Monte Carlo estimates of subgradients, estimation using stochastic quasi-gradient is biased, and moreover, these estimates are not $\epsilon $-subgradients in the standard sense because the errors are not required to be positive. Recall that we let $G(x,\xi )$ denote a subgradient of $F(x,\xi )$. By the scaling and addition properties of subdifferentials, we define Non-parametric Stochastic Quasi-gradient (NSQG) below.

Definition 6

Let ${\hat{G}}_N(x,\omega ; S^N)$ be a measurable function of x, $\omega $, and data $S^N$. ${\hat{G}}_N(x,\omega ; S^N)$ is a Non-parametric Stochastic Quasi-gradient (NSQG) of $f(x,\omega )$ in (2), if it can be written as the weighted sum below,

$$\begin{aligned} {\hat{G}}_N(x,\omega ; S^N) \triangleq \sum _{i=1}^N v_{N,i}(\omega ) G(x,\xi _i), \end{aligned}$$

(8)

where $G(x,\xi _i)$ is the subgradient of $F(x,\xi _i)$ and the weight $v_{N,i}(\omega )$ is calculated via non-parametric estimation methods.

This notion of NSQG in Definition 6 is the key component of LEONA algorithm and distinguishes it from other methods. According to Altman [1], a non-parametric estimator (e.g., kNN estimator and kernel estimators) is often biased. But under mild assumptions, the following analysis will show that the bias of both kNN and kernel estimators of true subgradients decrease to zero as the sample size increases to infinity. For now, we focus on a general case in which the estimation bias converges to 0 uniformly on X and show that such a condition will guarantee that ${\hat{G}}_N(x,\omega ; {\tilde{S}}^N)$ is a stochastic quasi-gradient of $f(x,\omega )$.

Lemma 3.3

Let ${\hat{f}}_N(x,\omega ;{\tilde{S}}^N)$ be defined in (3). Suppose assumptions A0 - A4 and A6 are satisfied. Then ${\hat{G}}_N(x,\omega ;{\tilde{S}}^N)$ is a stochastic quasi-gradient of $f(x,\omega )$ by definition 5.

Proof

See Appendix A.1. $\square $

3.2 Algorithm design

Now we introduce a subroutine from the LEONA algorithm, which we refer to as NSQG Update. In formulating the joint-estimation-optimization process, we adopt a k nearest-neighbor (kNN)/ kernel approach for subgradient estimation and a stochastic quasi-gradient method for optimization ([16]). Such a process can be understood as a simultaneous learning approach that uses statistical learning to improve the decision making in the first-order algorithm. Given the current estimated decision $x_l$, stepsize $\gamma _l$, and sample size $N_l$, the key feature of NSQG Update is that it generates an NSQG at $x_l$, ${\hat{G}}_{N_l}(x_l,\omega ;S^{N_l}_l)$, to update the decisions. The associated update formula is given as follows:

$$\begin{aligned} x_{l+1} = \Pi _X (x_l - \gamma _l {\hat{G}}_{N_l}(x_l,\omega ;S^{N_l}_l)) \end{aligned}$$

(9)

which is equivalent to solving a proximal mapping problem:

$$\begin{aligned} x_{l+1} = \arg \min _{x \in X} \{x^\top {\hat{G}}_{N_l}(x_l,\omega ;S^{N_l}_l) + \frac{1}{2 \gamma _l} \Vert x - x_l\Vert ^2 \} \end{aligned}$$

(10)

Based on data generating process in the NSQG update, the dataset ${\tilde{S}}^{N_l}_l$ is independent of ${\tilde{S}}^{N_j}_j$ for any $j \ne l$.

Inspired by Polyak’s averaging approach in stochastic approximation [40, 44], we utilize the proposed NSQG Update (in Algorithm 1) to develop a first-order method which guarantees asymptotic convergence. We refer to this method as a LEONA algorithm. It is worth noting that since the response is assumed to be parametrized by the predictor, the convergence of the proposed algorithm is consistent for all possible realizations of the predictor. Let $x_0 \in X$ denote the initial estimated solution. Let n, m be some pre-specified positive integers (e.g., $n = 1, m = 2$). Here, we have $n-1$ number of NSQG calls for warm starting, which generates $x_1,\ldots , x_n$. m is the increment on the moving window size. In other words, parameter n determines the number of NSQG updates before the moving average is initialized, while parameter m determines how fast the size of the moving window increases. We further let $i_0 = n$ and $j_0 = n - 1$, which will be used to determine which portion of the iterates for decision averaging. At the $q^{\textrm{th}}$ outer iteration, it computes a constant step size,

$$\begin{aligned} {\tilde{\gamma }}_q = \frac{2D_X}{M_G \sqrt{mq}}, \end{aligned}$$

where $D_X = \max \limits _{x \in X} \Vert x - x_0\Vert $ and $M_G$ is the upper bound of the subgradient (see assumption A4). The window size (i.e., number of calls of NSQG updates) of the current inner loop is mq, which increases as q increases. It then calculates a starting index, $i_q = i_{q-1} + m(q - 1)$ and an ending index, $j_q = j_{q - 1} + mq$, which are consistent with the indices of the previously generated iterates. Starting from $x_{i_q - 1}$, we call NSQG updates mq times to generate $x_{i_q}, x_{i_q + 1}, \ldots , x_{j_q}$. Then we take the average of $\{x_{i_q}, x_{i_q + 1}, \ldots , x_{j_q}\}$ to get the estimated decision in the $q^{th}$ outer loop:

$$\begin{aligned} {\tilde{x}}_q = \frac{1}{mq}\sum _{l=i_q}^{j_q} x_l. \end{aligned}$$

The entire process of the LEONA algorithm is summarized in algorithm 2. When $M_G$ and $D_X$ are unknown, one chooses a step size ${\tilde{\gamma }}_q = \frac{C}{\sqrt{mq}}$, where C is a finite positive constant (i.e., $C \in (0, \infty )$). In other words, as long as we choose a constant step size (in each moving window) which is of the order of the inverse of the square root of the moving window size (i.e., $O(\frac{1}{\sqrt{mq}})$), the asymptotic convergence of LEONA algorithm will be assured. This circumstance will be discussed in Sect. 3.3.

Remark

The motivation for using the moving averaging window is as follows. By increasing the size of the averaging window (i.e. increase q in mq), we generate a sequence of averaged estimated solutions ${{\tilde{x}}_q}$, which is shown to produce an optimality gap which converges to 0.

3.3 Convergence analysis

There are two meta-schemes in the LEONA algorithm: one is function estimation, and the other is solution approximation. We do not solve non-parametric SAA to obtain the estimated solution. Instead, we use a first-order approximation of objective function in the non-parametric SAA to improve the estimated solution. Such a process evolves as new samples are encountered. As a result, both the sequence of approximations and the sequence of estimated solutions (based on such approximations) need to be shown to converge. To the end of this section, we let l denote the $l^{th}$ iteration of NSQG Update. Let $x_l$ and $x_{l+1}$ denote two successive iterates generated by NSQG Update. Let $\{{\tilde{x}}_q\}_q$ denote a sequence of decisions generated by LEONA, where q denotes the outer iteration number.

In this subsection, we first develop a contraction property of the NSQG Update and then prove the convergence of the LEONA algorithm in expectation.

Lemma 3.4

(Contraction in NSQG Update) Suppose that assumptions A0–A4 and A6 are satisfied. Then, for any $x' \in X$ the following inequality holds for all l:

$$\begin{aligned} \begin{aligned} {\mathbb {E}}[\Vert x_{l+1} - x'\Vert ^2] \le \ {}&{\mathbb {E}}[\Vert x_l - x'\Vert ^2] - 2 \gamma _l {\mathbb {E}}[f(x_l,\omega ) - f(x',\omega )] + \gamma _l^2 M_G^2\\&+ 2 \gamma _l {\mathbb {E}}[\delta _{N_l}(x_l, \omega ) + \delta _{N_l}(x',\omega )] \end{aligned}. \end{aligned}$$

Proof

See Appendix A.2. $\square $

There are two types of estimated solution updates in the LEONA algorithm. Notwithstanding the bias associated with stochastic quasi-gradients, LEONA assures asymptotic convergence by extending the averaging process in which we use an ever-increasing size of the window. In the following, we let $x^* \in \arg \min \limits _{x \in X} f(x,\omega )$ specified in (2) and $c^*(\omega ) = \min \limits _{x \in X} f(x,\omega )$.

Lemma 3.5

Let $\omega $ be fixed. Suppose assumptions A0–A4 and A6 are satisfied. Let $A_l = \sup \{\delta _{N_l}(x,\omega ): x \in X \}$. Then the following holds:

$$\begin{aligned} {\mathbb {E}}[f({\tilde{x}}_q,\omega ) - c^*(\omega ))] \le \frac{2}{mq} \sum _{l=i_q}^{j_q} A_l + \frac{2 D_X M_G }{\sqrt{mq}}, \end{aligned}$$

where both $i_q$ and $j_q$ are defined in Algorithm 2.

Proof

See Appendix A.3. $\square $

The systematic error of estimated solutions in the $q^{th}$ outer loop of the algorithm consists of errors resulting from the step size, which is $\frac{2 D_X M_G }{\sqrt{mq}}$, as well as the bias from non-parametric estimation, which is $\frac{2}{mq} \sum _{l=i_q}^{j_q} A_l$. Now we state the convergence of the LEONA algorithm in the following theorem.

Theorem 3.6

(Asymptotic convergence of LEONA) Let $\omega $ be fixed. Suppose assumptions A0–A4 and A6 are satisfied. Then the following holds:

$$\begin{aligned} \lim _{q \rightarrow \infty } {\mathbb {E}}[f({\tilde{x}}_q,\omega ) - c^*(\omega )] = 0 \end{aligned}$$

Proof

According to Lemma 3.5, we shall show that both $\frac{2}{mq} \sum _{l=i_q}^{j_q} A_l$ and $\frac{2 D_X M_G }{\sqrt{mq}}$ converge to 0 as q goes to infinity. It is obvious that $ \lim \limits _{q \rightarrow \infty } \frac{2D_X M_G }{\sqrt{mq}} = 0$. By assumption A6, we have $ \lim \limits _{l \rightarrow \infty } A_l = 0$. We let $A_{l_q} = \max _{i_q \le l \le j_q} \{ A_l\}$and hence $0 \le \frac{2}{mq} \sum _{l=i_q}^{j_q} A_l \le 2 A_{l_q}$. Since $\{ A_{l_q} \}$ is a subsequence of $\{ A_l \}$ and $\lim \limits _{l \rightarrow \infty } A_l = 0$, then $\lim \limits _{q \rightarrow \infty } A_{l_q} = 0$. Hence, $\lim \limits _{q \rightarrow \infty } \frac{2}{mq}\sum _{l=i_q}^{j_q} A_l = 0$. Therefore, we have $ \lim \limits _{q \rightarrow \infty } {\mathbb {E}}[f({\tilde{x}}_q,\omega ) - c^*(\omega ))] = 0. $ $\square $

Remark

The analysis within each moving window has the same flavor as that in [40]. However, the design of the moving window is the key to ensuring that the non-parametric estimation bias diminishes to 0. In the case when $M_G$ and $D_X$ are unknown, we can pick the stepsize $\frac{C}{\sqrt{mq}}$, where C is a positive constant (i.e., $C \in (0, \infty )$). In this case, we have $ {\mathbb {E}}[f({\tilde{x}}_q,\omega ) - c^*(\omega ))] \le \frac{2}{mq} \sum _{l=i_q}^{j_q} A_l + \frac{2 D^2_X + \frac{1}{2} C^2 M_G^2 }{C \sqrt{mq}}. $ Again, since $D_X$, $M_G$, m and C are finite constants, we have

$$\begin{aligned} \lim _{q \rightarrow \infty } \frac{2 D^2_X + \frac{1}{2} C^2 M_G^2 }{C \sqrt{mq}} = \lim _{q \rightarrow \infty } q^{-\frac{1}{2}} \left[ \frac{2 D^2_X + \frac{1}{2} C^2 M_G^2 }{C \sqrt{m}} \right] = 0. \end{aligned}$$

This implies that $\lim \limits _{q \rightarrow \infty } {\mathbb {E}}[f({\tilde{x}}_q,\omega ) - c^*(\omega ))] = 0$. Although this step size rule is not optimal, the asymptotic convergence of LEONA is still valid.

3.4 LEONA with kNN estimation

We begin by providing the kNN realization of NSQG and then analyze the impact of the bias on the kNN estimate. We will show that LEONA with kNN estimation (LEONA-kNN) satisfies the conditions in Theorem 3.6.

Given a positive integer $k_N$, a dataset $S^N$, and an observation of the predictor $\omega $, we aim to find the $k_N$ covariates from this dataset closest (in the Euclidean distance) to $\omega $. We require that ties are broken by indices (see the discussion of tie-breaking by Walk [51]), and $k_N$ must be smaller than N. If $\omega _i$ belongs to the set of $k_N$ nearest neighbors of $\omega $, $v_{N,i}(\omega )$ is set to be 1; otherwise, $v_{N,i}(\omega )$ is set to be 0. In other words, $v_{N,i}(\omega ) = {\mathbb {I}}\left( \omega _i \in {\mathcal {S}}(k_N,\omega ; \{\omega _j \}_{j=1}^N) \right) $ for $i = 1,2,\ldots ,N$.

Corollary 3.7

(Asymptotic convergence of LEONA-kNN) Let $\omega $ be fixed. Let $v_{N,i}(\omega ) = \frac{1}{k_N}{\mathbb {I}}(\omega _i \in {\mathcal {S}}(k_N,\omega ;\{\omega \}_{j=1}^N))$ and $k_N = \lfloor N^\beta \rfloor $, $\beta \in (0,1)$. Let $\{{\tilde{x}}_q\}_q$ be generated by LEONA-kNN. Suppose assumptions A1 - A5 are satisfied. Then $\lim \limits _{q \rightarrow \infty } {\mathbb {E}}[f({\tilde{x}}_q, \omega ) - c^*(\omega )] = 0$.

Proof

Note that assumption A0 is automatically satisfied by the weight function calculation of kNN estimation. By assumptions A4–A5, it is easy to verify that $\{{\mathbb {E}}[{\hat{f}}_{k_N,N}(\cdot ,\omega ;{\tilde{S}}^N)] \}_{N}$ is Lipschitz continuous with a common Lipschitz constant, $L_F$, which further implies the equicontinuity. For any $x \in X$, by Theorem 2.1, we have $ \lim \limits _{N \rightarrow \infty } {\hat{f}}_{k_N,N}(x,\omega ;{\tilde{S}}^N) = f(x,\omega ) \ w.p. 1$ Then by the Bounded Convergence Theorem, we have

$$\begin{aligned} \lim _{N \rightarrow \infty } {\mathbb {E}}[{\hat{f}}_{k_N,N}(x,\omega ;{\tilde{S}}^N)] = f(x,\omega ). \end{aligned}$$

(11)

Combining (11) and equicontinuity of $\{{\mathbb {E}}[{\hat{f}}_{k_N,N}(\cdot ,\omega ;{\tilde{S}}^N)] \}_{N}$ in (1), by [45, Theorem 7.25], we conclude that $ \lim \limits _{N \rightarrow \infty } \vert {\mathbb {E}}[{\hat{f}}_{k_N,N}(x,\omega ;{\tilde{S}}^N)] - f(x, \omega ) \vert = 0 \ \text {uniformly on } X. $ Finally, by Theorem 3.6, we get the desired result. $\square $

3.5 LEONA with kernel estimation

In this subsection, we introduce another variation of the LEONA algorithm, which utilizes the Nadaraya-Watson kernel estimator to estimate the subgradient of $f(x,\omega )$. Recall that the kernel function is denoted as K. In the literature on kernel methods, $h_N$ is known as bandwidth. We shall refer to this variation as LEONA-kernel. To ensure the pointwise convergence of kernel estimate, Theorem 2.2 suggests that $h_N$ needs to satisfy $h_N= C N^{-\beta }, \ \beta \in (0,\frac{1}{m_1})$ (recall that $m_1$ is the dimension of the predictor ${\tilde{\omega }}$). The non-parametric stochastic quasi-gradient of $f(x,\omega )$ based on the Nadaraya-Watson kernel estimator is written as follows, which we refer to as the kernel estimate of NSQG.

$$\begin{aligned} {\hat{G}}^{\textrm{kernel}}_{N}(x,\omega ;S^N) \triangleq \frac{\sum _{i=1}^{N} G(x,\xi _i) K(\frac{\omega - \omega _i}{h_N})}{\sum _{i=1}^{N} K(\frac{w - w_i}{h_N})}. \end{aligned}$$

(12)

The benchmark function of kernel estimate of NSQG is provided below.

$$\begin{aligned} {\hat{f}}^{\textrm{kernel}}_N(x,\omega ;S^N) \triangleq \frac{\sum _{i=1}^N F(x,\xi _i) K(\frac{\omega - \omega _i}{h_N})}{\sum _{i=1}^N K(\frac{\omega - \omega _i}{h_N})}. \end{aligned}$$

(13)

In the analysis of kernel estimates, we define $\frac{0}{0} = 0$ (see [52]), because it is possible that $\sum _{i=1}^N K(\frac{\omega - \omega _i}{h_N}) = 0$ when a kernel with compact support (e.g., Naïve kernel) is used. With this definition, for any $K \in {\mathcal {K}}$, we have

$$\begin{aligned} \frac{K(\frac{w - w_i}{h_N})}{\sum _{i=1}^N K(\frac{\omega - \omega _i}{h_N})} \ge 0 \ \forall \ i, \ \text {and} \ \sum \frac{K(\frac{w - w_i}{h_N})}{\sum _{i=1}^N K(\frac{\omega - \omega _i}{h_N})} \le 1. \end{aligned}$$

(14)

Therefore, equation (14) implies that the requirement in assumption A0 is satisfied.

Corollary 3.8

(Asymptotic convergence of LEONA-kernel) Let $K \in {\mathcal {K}}^{kernel}$ and let $\omega $ be fixed. Let $v_{N,i}(\omega ) = \frac{K(\frac{\omega - \omega _i}{h_N})}{\sum _{i=1}^N K(\frac{\omega - \omega _j}{h_N})}$, where the bandwidth $h_N$ satisfies that $h_N= C N^{-\beta }, \ C > 0, \ \beta \in (0,\frac{1}{m_1})$. Let $\{{\tilde{x}}_q\}_q$ be generated by LEONA-kernel. Suppose assumptions A1 - A5 are satisfied. Further suppose that ${\mathbb {E}}\left[ \vert F(x,{\tilde{\xi }}) \vert \max \{\log \vert F(x,{\tilde{\xi }}) \vert ,0 \} \right] < \infty $ for all $x \in X$. Then the following holds:

$$\begin{aligned} \lim _{q \rightarrow \infty } {\mathbb {E}}[f({\tilde{x}}_q,\omega ) - c^*(\omega )] = 0. \end{aligned}$$

Proof

The proof is similar to Corollary 3.7, the only difference is that Theorem 2.2 is used for the pointwise convergence of kernel estimate. $\square $

4 Non-parametric stochastic decomposition

We begin by noting that traditional cutting plane methods (e.g., Benders) rely on valid supporting hyperplanes of the true objective function. In the presence of streaming data, the cuts computed in any iteration are not only biased at a given point but may in fact cut away portions of the asymptotic objective function and may never recover without additional modifications to Benders-type methods. However, SD-type algorithms ([27, 28, 47]) avoid such issues by performing a local piecewise linear approximation of the objective function followed by necessary cut updates. For this reason, we propose a non-parametric extension of the SD algorithm to solve a two-stage predictive stochastic linear programming (two-stage predictive SLP) problem stated as follows.

$$\begin{aligned} \begin{aligned} \min _{x} \ f(x, \omega ) =&\ c^\top x + {\mathbb {E}}_{{\tilde{\xi }} \sim K(\omega ,\cdot )}[h(x,{\tilde{\xi }}) \vert {\tilde{\omega }} = \omega ]\\ \text {s.t.} \ {}&x \in X \triangleq \{x \in {\mathbb {R}}^n : A x \le b\}, \end{aligned} \end{aligned}$$

(15)

where $h(x,\xi )$ is the second-stage recourse function, whose formulation is

$$\begin{aligned} \begin{aligned} h(x,\xi ) = \min \{ d^\top y : D y = e(\xi ) - C(\xi ) x, \ y \ge 0\}. \end{aligned} \end{aligned}$$

Throughout this section, we assume the following conditions hold:

(B1):: X is a compact convex nonempty subset of ${\mathbb {R}}^n$.
(B2):: The second-stage subproblem satisfies the relative complete recourse property.
(B3):: (Regularity of the probability space) The set of all the possible outcomes of ${\tilde{\xi }}$, ${\mathcal {Y}}$, is compact. ${\mathcal {Y}}$ is a Borel subset of a complete separable metric space and $\Sigma _{{\mathcal {Y}}}$ is the $\sigma $-algebra generated by ${\mathcal {Y}}$. ${\mathcal {X}}$ is a Borel subset of a complete separable metric space and $\Sigma _{{\mathcal {X}}}$ is the $\sigma $-algebra generated by ${\mathcal {X}}$. Let $S = {\mathcal {X}} \times {\mathcal {Y}}$ and let $\Sigma _{\mathcal {S}}$ be the $\sigma $-algebra generated by ${\mathcal {S}}$. ${\mathcal {S}}$ is a Borel subset of a complete separable metric space. Furthermore, the joint distribution function $\mu _{{\tilde{\omega }}, {\tilde{\xi }}}$ exists and the conditional distribution of ${\tilde{\xi }}$ given ${\tilde{\omega }} = \omega $ is defined by (4).
(B4):: For any $\xi \in {\mathcal {Y}}$, ${\mathbb {P}}(\Vert {\tilde{\xi }} - \xi \Vert < \delta ) > 0$ for any $\delta > 0$.
(B5):: $({\tilde{\omega }}_1, {\tilde{\xi }}_1), ({\tilde{\omega }}_2, {\tilde{\xi }}_2), \ldots $ are i.i.d. random variables.
(B6):: $h(x,\xi ) \ge 0$ for all $x \in X$ and almost every $\xi \in {\mathcal {Y}}$. There exists $M_h < \infty $, such that $h(x,\xi ) \le M_h$ for all $x \in X$ and almost every $\xi \in {\mathcal {Y}}$.
(B7):: There exists $M_C \in (0, \infty )$ such that $\Vert C(\xi )\Vert \le M_C$ for almost every $\xi \in {\mathcal {Y}}$.
(B8):: $k_l = \lfloor l^\beta \rfloor $ and $\beta \in (0,1)$, for $l \in {\mathbb {N}}$.

Assumptions B1–B2 are standard in the SP literature. Assumptions B3–B5 ensure that the algorithm can sample a scenario that is arbitrarily close to the one that is already generated, which is essential to the correctness of the algorithm. Assumption B6 is necessary for the update of the previous minorants in the streaming data applications. Assumptions B2 and B7 altogether imply that second-stage recourse function is Lipschitz continuous on X. Unlike Machine Learning (ML) where deviations are not constrained optimization problems, two-stage SP models have the potential to magnify the impact of uncertainty due to second-stage constraints. As a result, the boundedness assumption maintains some regularity in the objective functions associated with estimated decisions. For the same reason, we also include the relative complete recourse assumption (assumption B2) which precludes infeasibility of the second-stage problem. Assumption B8 ensures that $k_l$ satisfies the regularity condition in the convergence of kNN estimation.

For the sample size equal to l, the $k_l$ nearest neighbor estimate of $f(x, {\tilde{\omega }})$ in (15) is

$$\begin{aligned} {\hat{f}}^{k_l}_l(x, \omega ; S^l) \triangleq c^\top x + \frac{1}{k_l} \sum _{i=1}^l {\mathbb {I}}\left( \omega _i \in {\mathcal {S}}(k_l,\omega ; \{\omega _j \}_{j=1}^l) \right) h(x,\xi _i), \end{aligned}$$

where $S^l = \{(\omega _i,\xi _i) \}_{i=1}^l$ is a dataset and ${\mathcal {S}}(k_l,\omega , \{\omega _j \}_{j=1}^l))$ is the k nearest neighbors of $\omega $ from the sample set $\{\omega _j \}_{j=1}^l$. An alternative representation of ${\hat{f}}^{k_l}_l$ is by using order statistics, which is $ {\hat{f}}^{k_l}_l(x, \omega ; S^l) = c^\top x + \frac{1}{k_l} \sum _{i=1}^{k_l} h(x,\xi ^l_{[i]})$, where the tuple $( \omega ^l_{[i]},\xi ^l_{[i]})$ has $\omega ^l_{[i]}$ which is the $i^{th}$ nearest neighbor of $\omega $ from the set $\{\omega _j \}_{j=1}^l$.

4.1 Algorithm design

In the SD-kNN algorithm, the minorant is referred to as a linear function of the first-stage decision, x, that is a lower bound of the kNN estimate of the objective function. One iteration of the SD-kNN in Algorithm 3 consists of five components. We will illustrate all five components by introducing the entire process in the $l^{\textrm{th}}$ iteration. The first component is Candidate Selection, which selects the next candidate solution by solving a proximal mapping problem given piecewise linear approximation of the kNN estimate of objective function. Problem (16) is equivalent to

$$\begin{aligned} x^l = \arg \min _{x \in X} \ {\check{f}}_{l-1}(x, \omega ) + \frac{\sigma _l}{2} \Vert x - {\hat{x}}^{l-1} \Vert ^2, \end{aligned}$$

where ${\check{f}}_{l-1}(x, \omega )$ is a piecewise linear approximation of the kNN estimate of objective function, ${\hat{f}}^{k_{l-1}}_{l-1}(x, \omega ; S^{l-1})$, ${\hat{x}}^{l-1}$ is the incumbent solution in the last iteration and $x^l$ is the current candidate solution. SD-kNN then generates a new covariate, $(\omega _l,\xi _l)$, let $k_l = \lfloor l^\beta \rfloor $, and find the new kNN set, ${\mathcal {S}}(k,\omega ; \{\omega _j \}_{j=1}^{l}) = \{ \omega ^l_{[i]} \}_{i=1}^{k_l}$. Next, SD-kNN computes a “possibly" dual extreme point at the new candidate solution for the new covariate:

$$\begin{aligned} \pi ^l \in \arg \max \{ \pi ^\top (e(\xi _l) - C(\xi _l) x^l) \ \vert \ \pi ^\top D \le d^\top \}. \end{aligned}$$

By including the newly computed dual extreme point, the set of explored dual extreme points is updated. The minorants which are inactive at the new candidate solution will also be removed (see equation (17)). The second and third components are Dual Selection and Minorant Construction, which use the set of explored dual extreme points and the kNN set to construct new minorants at the candidate and incumbent solutions. The constructed minorants are denoted as ${\mathcal {C}}_l^l(x,\omega )$ and ${\mathcal {C}}_{-l}^{l}(x,\omega )$ (see algorithm 3 for details). The fourth component is Minorant Update, which updates the minorants in the previous iterations for streaming data applications:

$$\begin{aligned} {\mathcal {C}}^{l}_i(x,\omega ) = {\left\{ \begin{array}{ll} {\mathcal {C}}^{l-1}_i(x,\omega ) - \frac{M_h}{k_{l-1}} &{} \text {if } k_l = k_{l-1}, \\ \frac{k_{l-1}}{k_{l}} {\mathcal {C}}^{l-1}_i(x,\omega ) &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

(18)

By doing so, it ensures that $c^\top x + {\mathcal {C}}^{l}_i(x,\omega ) \le {\hat{f}}^{k_{l}}_{l}(x, \omega ; S^{l})$, which will be proven later. SD-kNN then uses the updated minorants and newly constructed minorants to construct a new picewise linear estimate of the objective function in (15):

$$\begin{aligned} \check{f}_l(x,\omega ) = c^\top x + \max \{ {\mathcal {C}}^l_i (x,\omega ): i \in {\mathcal {J}}_l\}, \end{aligned}$$

where ${\mathcal {J}}_l$ is an index set of the minorants used for function construction (see algorithm 3 for details). The last component is Incumbent Selection, which selects a sequence of incumbents that ensure “relative" descent:

$$\begin{aligned} {\hat{x}}^l = {\left\{ \begin{array}{ll} x^l &{} \text {if } \check{f}_l(x^l,\omega ) - \check{f}_l({\hat{x}}^{l-1}, \omega ) \le q \left[ \check{f}_{l-1}(x^l,\omega ) - \check{f}_{l-1}({\hat{x}}^{l-1}, \omega )\right] , \\ {\hat{x}}^{l-1} &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

We summarize all the steps of SD-kNN in algorithm 3.

Since both Candidate Selection and Incumbent Selection are inspired by the regularized SD in [28], we can get the same descent condition by solving (16).

Lemma 4.1

Given $\sigma \ge 1$, then the following holds:

$$\begin{aligned} \check{f}_l(x^{l+1},\omega ) - \check{f}_l({\hat{x}}^{l},\omega ) \le -\sigma \Vert x^{l+1} - {\hat{x}}^{l} \Vert ^2 \le -\Vert x^{l+1} - {\hat{x}}^{l} \Vert ^2 \le 0. \end{aligned}$$

Recall that in the LEONA algorithm, we generate a dataset of covariates that is independent from the past to calculate NSQG. Thereby, the previously calculated NSQG will never be used in the future. On the other hand, SD-kNN uses an incremental-sampling schema in the sense that it uses all the generated data to calculate a new NSQG. Moreover, it uses previously calculated NSQGs and current NSQGs to construct an outer approximation of the objective function. Similar to SD, SD-kNN also keeps a record of captured second-stage duals to accelerate the computation of the NSQGs.

Compared with SD, the main differences of SD-kNN lies in the minorant construction and minorant updates. Namely, the minorant construction will be achieved by the kNN estimation method and the minorant update in (18) will ensure the updated minorants become a lower bound of the new kNN estimate of the obejctive function. In the SD-kNN algorithm, we do not explicitly compute the Lagrange multiplier for every sample. Instead, we only compute the Lagrange multiplier for samples that are from the k nearest neighbor set of $\omega $. That is, if $k_l = k_{l+1}$ and $\Vert \omega ^l_{[k_l]} - \omega \Vert = \Vert \omega _{l+1} - \omega \Vert $, then $\omega _{l+1}$ will not be included in the $k_{l+1}$ nearest neighbor set. The correctness of the new minorant update is validated in the following proposition.

Proposition 4.2

(Lower Bound Properties in the Minorant Update) Suppose that assumptions B1–B8 hold. Let ${\mathcal {J}}_l$ denote the set of minorants used in the $l^{\textrm{th}}$ iteration and let ${\mathcal {C}}^l_i(x,\omega ) = \alpha ^l_i + (\beta ^l_i)^\top x $, $i \in {\mathcal {J}}_l$ be the minorant that is generated first at iteration i and updated at iteration l. Suppose that ${\hat{f}}^{k_l}_l(x,\omega ; S^l) \ge c^\top x + {\mathcal {C}}^l_i(x,\omega )$ for all $i \in {\mathcal {J}}_l$. Then the following holds:

a.
If $k_{l+1} = k_l$, then ${\hat{f}}^{k_{l+1}}_{l+1}(x,\omega ;S^l) \ge c^\top x + C^l_i(x,\omega ) - \frac{M_h}{k_l}$, $i \in {\mathcal {J}}_l$.
b.
If $k_{l+1} > k_l$, then ${\hat{f}}^{k_{l+1}}_{l+1}(x,\omega ;S^{l+1}) \ge c^\top x + \frac{k_{l}}{k_{l+1}} C^l_i(x,\omega )$, $i \in {\mathcal {J}}_l$.

Proof

See Appendix A.4. $\square $

Remark 1

Note that in the beginning, $\frac{M_h}{k_{l}}$ may be so large that the updated cuts remain below zero for a large portion of $x \in X$. But as $\ell $ increases, $k_l$ will become large enough so that the effect of $\frac{M_h}{k_{l}}$ becomes negligible.

Remark 2

There is no loss of generality with the assumption that h is bounded below by 0. This follows from the fact that in the more general case where h is bounded below by the constant, $m_h$, the update rule can be modified to the following form:

a.
If $k_{l+1} = k_l$, then ${\hat{f}}^{k_{l+1}}_{l+1}(x,\omega ;S^l) \ge c^\top x + C^l_i(x,\omega ) - \frac{M_h - m_h}{k_l}$, $i = 1,2,\ldots ,l$.
b.
If $k_{l+1} > k_l$, then ${\hat{f}}^{k_{l+1}}_{l+1}(x,\omega ;S^{l+1}) \ge c^\top x + \frac{k_{l}}{k_{l+1}} C^l_i(x,\omega ) + \frac{k_{l+1} - k_l}{k_{l+1}} m_h$, $i = 1,2,\ldots ,l$.

By the minorant construction and updates in the SD-kNN algorithm, it is easy to verify that $\check{f}_l(\cdot ,\omega )$ is the outer approximation of ${\hat{f}}^{k_l}_l(\cdot ,\omega )$, which is summarized in the following lemma.

Lemma 4.3

Suppose that assumptions B1–B8 are satisfied. The following hold:

a.
$ h_l(x,\xi ) \le h(x,\xi ), \text { for all } x \in X, \ l \in {\mathbb {N}}$.
b.
$c^\top x + {\mathcal {C}}^l_l(x,\omega ) \le f^{k_l}_l(x,\omega )$, for all $x \in X$, $l \in {\mathbb {N}}$. $c^\top x + {\mathcal {C}}^l_{-l}(x,\omega ) \le f^{k_l}_{-l}(x,\omega )$, for all $x \in X$, $l \in {\mathbb {N}}$.
c.
$\check{f}_l(x,\omega ) \le {\hat{f}}^{k_l}_l(x,\omega )$, for all $x \in X$, $l \in {\mathbb {N}}$.

4.2 Convergence results

The theoretical analysis of the SD-kNN algorithm consists of three steps. In the first step, we show the asymptotic convergence of the generated minorants at any accumulation point. In the next step, we show the subsequence convergence of candidates and incumbents and their directional derivatives corresponding to piecewise linear approximations of objective function generated by SD-kNN. The third step is to show that the sequence of incumbent solutions generated by SD-kNN will converge to the true optimal solution to (15) with probability one. For the remainder of this section, we let $\{x^l\}_{l=1}^\infty $ and $\{{\hat{x}}^l\}_{l=1}^\infty $ denote the sequences of master problem and incumbent solutions respectively, identified by SD-kNN.

Theorem 4.4

(Asymptotic Convergence of the Minorants) Suppose assumptions B1–B8 hold. Let $\{(\alpha ^{l}_{l},\beta ^l_l)\}_l$ and $\{(\alpha ^{l}_{-l},\beta ^l_{-l})\}$ be the sequence of coefficients calculated in the minorant construction of SD-kNN (Algorithm 3). With probability one, the following hold:

a.
For any accumulation point, ${\hat{x}}$, of $\{x^l\}_{l=1}^\infty $, one has
$$\begin{aligned} \begin{aligned} \lim _{n \rightarrow \infty } \alpha ^{l_n}_{l_n} + (\beta ^{l_n}_{l_n})^\top x^{l_n} = {\mathbb {E}}_{{\tilde{\xi }} \sim K(\omega ,\cdot )}[h({\hat{x}},{\tilde{\xi }}) \vert {\tilde{\omega }} = \omega ], \end{aligned} \end{aligned}$$
where $\{x^{l_n} \}_{n=1}^{\infty }$ is an infinite subsequence of $\{x_l\}_{l=1}^\infty $ and $\lim \limits _{n \rightarrow \infty } x^{l_n} = {\hat{x}}$.
b.
For any accumulation point, ${\hat{x}}'$, of $\{{\hat{x}}^l\}_{l=1}^\infty $, one has
$$\begin{aligned} \begin{aligned} \lim _{n' \rightarrow \infty } \alpha ^{l_{n'}}_{-l_{n'}} + (\beta ^{l_{n'}}_{-l_{n'}})^\top {\hat{x}}^{l_{n'}} = {\mathbb {E}}_{{\tilde{\xi }}\sim K(\omega ,\cdot )}[h({\hat{x}}',{\tilde{\xi }}) \vert {\tilde{\omega }} = \omega ], \end{aligned} \end{aligned}$$
where $\{{\hat{x}}^{l_{n'}} \}_{n'=1}^{\infty }$ is an infinite subsequence of $\{{\hat{x}}_l\}_{l=1}^\infty $ and $\lim \limits _{n' \rightarrow \infty } {\hat{x}}^{l_{n'}} = {\hat{x}}'$.

Proof

See Appendix A.5. $\square $

One direct consequence of Theorem 4.4 is that the estimated solution obtained by the SD-kNN algorithm will generate objective function estimates which converge to the true objective function value at any accumulation point of the sequence of incumbent solutions with probability one. This result is summarized in the following corollary.

Corollary 4.5

Suppose assumptions B1–B8 hold. With probability one, for any accumulation point, ${\hat{x}}$, of $\{{\hat{x}}^l \}$, the following holds:

$$\begin{aligned} \lim _{n \rightarrow \infty } \check{f}_{l_n}({\hat{x}}^{l_n},\omega ) = \lim \limits _{n \rightarrow \infty } \check{f}_{l_n+1}({\hat{x}}^{l_n},\omega ) = f({\hat{x}},\omega ), \end{aligned}$$

where $\{{\hat{x}}^{l_n}\}_{n=1}^\infty $ is the subsequence of $\{{\hat{x}}^l\}_{l=1}^\infty $ such that $\lim \limits _{n \rightarrow \infty } {\hat{x}}^{l_n} = {\hat{x}}$.

Proof

See Appendix A.6. $\square $

The lemma below establishes an important relationship between the directional derivative of the true objective function and the directional derivative of its piecewise linear approximation generated by SD-kNN.

Lemma 4.6

Suppose that assumptions B1–B8 hold and $\sigma \ge 1$. For a given $x \in X$, let $d^l(x) = \frac{ x - x^{l+1}}{\Vert x - x^{l+1}\Vert }$. Then with probability one, the following holds:

a.
There exists an infinite subsequence of iterations ${\mathcal {L}} \subset {\mathbb {N}}$ such that $\lim \limits _{l \in {\mathcal {L}}} x^{l+1} = \lim \limits _{l \in {\mathcal {L}}} {\hat{x}}^l = {\hat{x}}$, for some ${\hat{x}} \in X$.
b.
$\limsup \limits _{l \in {\mathcal {L}}} \check{f}'_l(x^{l+1},\omega ;d^l(x)) \le f'({\hat{x}},\omega ; \frac{x - {\hat{x}}}{\Vert x - {\hat{x}} \Vert })$.

Proof

See Appendix A.7. $\square $

With Lemma 4.6, we can prove the asymptotic convergence of the SD-kNN algorithm in the theorem below.

Theorem 4.7

(Asymptotic Convergence of SD-kNN) Suppose that assumptions B1–B8 hold and $\sigma \ge 1$. With probability one, the sequence $\{{\hat{x}}^l\}_{l=1}^\infty $ converges to an optimal solution of the problem (15) with probability one.

Proof

We first show that SD-KNN produces an infinite subsequence of iterations ${\mathcal {L}} \subset {\mathbb {N}}$ such that $\{x^{l+1}\}_{l \in {\mathcal {L}}}$ and $\{{\hat{x}}^l \}_{l \in {\mathcal {L}}}$ converges to the same optimal solution to (15). This result follows from Lemma 4.6 and [28, Theorem 5].

Finally, the almost sure convergence of $\{{\hat{x}}^l\}_{l=1}^\infty $ follows from [47, Theorem 1]. $\square $

Remark

Based on the extension of SD to solve two-stage stochastic quadratic-quadratic programming (SQQP) problems by [36], it is also quite straigtforward to extend SD-kNN algorithm to solve two-stage predictive SQQP problems.

5 Numerical experiments

Both LEONA algorithm and SD-kNN algorithm are implemented in the C++ environment. CPLEX v12.8.0, as a base solver, is used to solve convex quadratic programming / linear programming problems and obtain associated dual multipliers. All the programs are run on the Macbook Pro 2017 with 3.1 GHz Dual-Core Intel Core i5.

In the SD-kNN, we design a dynamic step size rule based on [48]. Namely, if the incumbent selection is passed (i.e., $\check{f}_l(x^l,\omega ) - \check{f}_l({\hat{x}}^{l-1},\omega ) \le q \left[ \check{f}_{l-1}(x^l,\omega ) - \check{f}_{l-1}({\hat{x}}^{l-1},\omega )\right] $), then we let $\sigma _l = \max \{ \frac{1}{2}\sigma _l, {\underline{\sigma }}\}$; otherwise, we let $\sigma _l = \min \{ 2 \sigma _l, {\overline{\sigma }}\}$, where $1 \le {\underline{\sigma }}< {\overline{\sigma }} < \infty $. In the actual implementation, we let $q = 0.5$, ${\underline{\sigma }} = 1$, and ${\overline{\sigma }} = 100$. To determine the initial incumbent solution of the SD-kNN, we design a presolve process. In particular, we use certain number of data points to compute the kNN estimate of the response, ${\bar{\xi }}\vert _{{\tilde{\omega }} = \omega }$. Let ${\bar{N}}$ denote the sample size for the presolve process, and let ${\bar{k}} = \lfloor {\bar{N}}^\beta \rfloor $ denote the corresponding number of nearest neighbors, the mathematical formulation is ${\bar{\xi }}\vert _{{\tilde{\omega }} = \omega } = \frac{1}{{\bar{k}}} \sum _{i=1}^{{\bar{N}}} {\mathbb {I}}\left( \omega _i \in {\mathcal {S}}({\bar{k}},\omega ; \{\omega _j \}_{j=1}^{{\bar{N}}}) \right) \xi _i. $ Then the initial incumbent solution, $x^0$ is obtained by solving a deterministic two-stage problem with ${\bar{\xi }}\vert _{{\tilde{\omega }} = \omega }$ as the realization of the second-stage scenario (i.e., $x^0 \in \arg \min \{c^\top x + h(x,{\bar{\xi }}\vert _{{\tilde{\omega }} = \omega }): x \in X\}$).

To validate the performance of the SD-kNN algorithm, we implemented a multi-cut Benders decomposition (see [4, Chapter 6.5]) to sequentially solve the PSP instances each time when a new sample is observed, which we refer to as online multi-cut Benders decomposition (OMBD). Based on OMBD, we implemented a variation that records all the previously calculated cuts. We shall refer to this variation as online Benders decomposition (OBD). The pseudocode of both benchmark algorithms are presented in Appendix B.

5.1 Two-stage predictive SLP

Here, we demonstrate the computational performance of LEONA and SD-kNN by solving two instances: a) a two-stage shipment planning problem, which first appeared in [3] and b) a two-stage multiproduct inventory problem, which is a modification of single-period multiproduct inventory model proposed by [2]. Throughout this section, we refer to the two-stage shipment planning problem and two-stage multiproduct inventory problem as BK19 and P-BAA99, respectively. The problem dimensions are summarized in Table 1. In particular, BK19 has 4 first-stage decision variables and 4 inequality constraints in the first stage, whereas, it has 52 inequality constraints (excluding non-negativity constraints) and 16 decision variables in one realization of the second-stage subproblem. Model P-BAA99 has 25 decision variables and 50 inequality constraints in the first stage, and it has 1,375 equality constraints and 100 decision variables in one realization of the second-stage subproblem. We replicate each algorithm 30 times (i.e., with 30 different random seeds) for each instance.

Table 1 Sizes of the two-stage Predictive SLP instances

Full size table

For simplicity, we let LEONA-E, LEONA-Q, LEONA-G, LEONA-N, and LEONA-k denote the LEONA with Epanechnikov kernel, quartic kernel, Gaussian kernel, Naïve kernel, and kNN method, respectively. The sample size is chosen to be consistent with number of NSQG calls in LEONA. To set the experiment in the presence of streaming data, we let the size of the initial observations be 50. Each time, a new sample will be observed. In LEONA-kNN and SD-kNN, $\beta $ is set to be 0.6. In LEONA with kernel estimation, $\beta $ is set to be $0.5/m_1$. As for BK19, C is set to be 1 for LEONA with Gaussian Kernels, while C is set to be 1 for LEONA with other kernels in ${\mathcal {K}}^{kernel}$. As for P-BAA99, C is set to be 10 for LEONA with Gaussian kernel, while C is set to be 50 for LEONA with other kernels in ${\mathcal {K}}^{kernel}$. The scale of the averaging window (m) in LEONA is set to be 1. As a result, the total number of samples used for LEONA is 8,645 after 13 outer iterations.

5.1.1 Two-stage shipment planning problem

In BK19, the observed predictor is set to $\omega = (-1.1401, 0.3406, 1.3871)$ and the estimated true cost is calculated by using 100,000 samples from the true conditional distribution (see [3] for the data generation process). As a result, the estimated cost of ${\hat{x}}$ is $\theta _v ({\hat{x}};\omega ) = \frac{1}{{\mathcal {N}}_v} \sum _{i=1}^{{\mathcal {N}}_v} F({\hat{x}},\xi _i^\omega )$ and $c^*(\omega ) = \min \limits _{x\in X} \frac{1}{{\mathcal {N}}_v}\sum _{i=1}^{{\mathcal {N}}_v} F(x,\xi _i^\omega )$ is the estimated “optimal" cost of the perfect-foresight problem. In particular, $c^*(\omega )$ is 772.152.

Table 2 BK19 (sample size $=$ 8,645)

Full size table

Table 2 and Figs. 1, 2 summarize the computational performance of the LEONA and SD-kNN in the instance of BK19. In Table 2, we summarize the average performance of each algorithm when the total number of observed samples is 8,645. Accordingly, SD-kNN outperforms LEONA in terms of solution quality and stability while requiring more computational time. The higher time consumption of SD-kNN is due to the fact that it updates the solution for each new observation of the covariate while LEONA updates the solution only for a certain batch of samples. On the other hand, OMBD/OBD, the benchmark algorithm, has only marginally better solution (by a fraction of a percent) compared with SD-kNN, at the cost of much greater computational time (i.e., 3315.64s/159.70s vs. 82.63s). For streaming data, implementing SD requires only one iteration between updates of data to ensure asymptotic convergence of the process. In this sense, it is appropriate to estimate the time per iteration to ascertain whether it is feasible to implement SD-kNN for streaming data applications. Consequently, we also report the time per iteration of SD to examine whether SD may provide a realistic approach for streaming data applications. We observe that the time per iteration of SD-kNN is 0.01s which seems reasonable for streaming data applications. Accordingly, it is appropriate to study trade-offs between SD-kNN and the two first-order (SQG) methods for streaming data applications. Figure 1a shows that SD-kNN converges faster than LEONA in terms of the number of samples used. It is likely that the smaller sample size observed for SD-kNN can be attributed to the piecewise linear approximations of the objective function in SD compared with the linear (first-order) approximations of LEONA.

In Fig. 2a, we observe that the number of unique dual vertices stored by SD-kNN converges to a point around 105 as the number of solution updates increases. In Fig. 2b, we observe that the number of minorants stored is between 2 and 18 across the entire computational process from all of 30 replications. The reason that the curve of the average minorants decreases is because “inactive minorants” (at a new candidate solution) can be removed (see equation (16)) as recommended by SD theory [28]. This controls the number of stored minorants from increasing beyond a finite upper bound [28].

We provide experimental evidence on the impact of the dimension of the predictor on the performance of the algorithms for BK19. In BK19, the response is a linear transformation of the predictor with some white noise. We increase the dimension of the predictor by including some auxiliary random variables, which are characterized as AR(1) time series. It turns out that the response still follows the same pattern as in the original BK19. For each algorithm, we evaluate the out-of-sample objective cost of the estimated decisions when the sample size equals to a point from the set, {50, 153, 315, 545, 855, 1260, 1778, 2430, 3240, 4235, 5445, 6903, 8645, 10710, 13140, 15980, 19278, 23085, 27455, 32445} (i.e., the set of points is based on the sample sizes used for LEONA’s outer iteration numbers ranging from 1 to 20). We evaluate the sample efficiency based on the following two metrics: a) the required sample size for average optimality gap to drop within 5% of the estimated optimal cost; b) computational time. These metrics are summarized in Table 3.

The first line of the table reports the performance of the benchmark, OMBD. The computational time reported in the first row demonstrates that an online version of Benders decomposition is so time-consuming as to be unacceptable for streaming data. When $m_1 = 15$, OMBD takes more than 55 min to reach the required precision (5% of optimal cost), while LEONA-E takes less than 7 s. Compared with LEONA, SD-kNN has better sample efficiency but with higher computational time. Compared with OMBD, SD-kNN has similar sample efficiency but with a much smaller computational time. Among the three algorithms compared in the study, LEONA takes the least time while requiring larger sample sizes for comparable accuracy. We observe that both LEONA-N and LEONA-k fail to provide decisions with the required precision when $m_1 \ge 9$ and the maximum sample size is 32,445. In other words, the dimension of the predictors has greater impact on the performance of LEONA-N and LEONA-k. Overall, we observe that LEONA-E (Epanechnikov kernel) provides the most effective combination of estimation and optimization among various choices of kernels in our study. Moreover, we observe that the required sample for each algorithm generally increases as the dimension of the predictor increases. If the number of data points is a constant, then the solution quality for higher dimensional problems will generally be of a lower quality as shown in Fig. 1b.

Table 3 Sample efficiency of algorithms in BK19

Full size table

5.1.2 Two-stage multiproduct inventory problem

Table 4 P-BAA99, sample size 5,445

Full size table

In P-BAA99, the predictor and demand (i.e., response) altogether follow a truncated multivariate normal distribution with mean vector, $\mu $, and covariance matrix, $\Sigma $ (where the simulation is based on [7]). The observed predictor is $\omega _i = 108 - \frac{i}{3}$, $i = 1,2,\ldots ,25$. We utilize kNN to estimate the true cost of a given solution. Although such an approach will inevitably incur some bias in measuring solution quality, this approach is available in practice and its performance is ensured by the pointwise convergence of kNN estimator [51]. To initiate the validation for P-BAA99, we generate $10^6$ predictor-response data pairs from unconditional distribution and find $10^3$ nearest neighbors of the observed predictor. As a result, the associated estimated optimal cost (i.e., $c^*(\omega )$) is -14,262. The computational performance of LEONA and SD-kNN is summarized in Table 4 and Fig. 3. Figure 3a plots the average solution quality versus sample size used. Similar to the results in BK19, SD-kNN converges the fastest in terms of the samples used. We also create the box plots of each algorithm when the total sample size is 5,445 in Fig. 3b to visually showcase the distribution of the solution qualities of each algorithm. Based on Table 4 and Fig. 3b, we observe that LEONA-G has much more variation than other algorithms. On the other hand, LEONA-Q attains the best average solution quality and LEONA-E attains the lowest variation when the sample size is 5,445. We also find that OMBD consumes much more time than SD-kNN (11,461.08s vs 707.86s) to perform the solution updates without any perceptible improvement to solution quality. With recording all the previously calculated cuts, the computational time of OBD drastically decreases from 11,461.08s to 895.77s, but even that is considerably higher than the time required for SD-kNN (707.86s) without improving solution quality. For real-time applications, the LEONA family of algorithms may be the most promising, because it is likely that the time lag (of a few seconds) may be easier to tolerate for streaming data applications. This encouraging observation is the first of its kind to be reported in the literature.

6 Conclusions

Implementing a streaming data algorithm requires more care in data structure because of the dynamic nature of information revealed in the algorithm. Depending on the calculations, each algorithm has its own way to manage streaming data and its impact on the algorithmic calculations. We provide a summary of key features of LEONA and SD-kNN in terms of standard streaming algorithms from the literature ([19, 21]). In LEONA, we have sliding windows of growing sizes updates (landmark windows in [19]) to reduce the workload of solution updates and use averaging to summarize the solution updates. In SD-kNN, we utilize several compact data structures to speed up data summaries for further solution updates (synopsis in [19]). In particular, we store $L_0$ and $L_1$ norms of all the unique dual extreme points to accelerate methods that recognize new dual extreme points. A sorted list of covariates also helps accelerate kNN estimation. To reduce redundant matrix calculations, we implement a data structure that records all the necessary matrix manipulations. Although the above expedients may seem trivial, their implementations are crucial to successful large-scale implementation.

To conclude, we present an assessment of the contributions which were mentioned in the Introduction. a) We design the LEONA algorithm which sequentially uses non-parametric stochastic quasi-gradients to update the estimated solutions to (2). We also design a fusion of non-parametric estimation and SD algorithm to further improve the efficiency of solution estimates. b) The convergence result is predicated on diminishing both statistical estimation bias and systematic optimality gap. c) Our results have been developed to apply to kNN and a variety of kernels (Naïve, Quartic, Epanechnikov and Gaussian) under one common umbrella. d) The numerical performance of the LEONA algorithm and the SD-kNN algorithm were studied using two-stage predictive stochastic linear programming problems. We observe that from the viewpoint of solution qualities, SD-kNN appears to be the best choice for these datasets, whereas LEONA-E and LEONA-Q appear to have the lowest latency (t(s)).

We believe that there is great potential of the practical use of PSP algorithms for streaming data applications where the distribution adapts automatically to incoming data due to non-parametric estimation methods. One possible theoretical extension is to derive the finite sample complexity as well as “regret" of both LEONA and SD-kNN algorithm based on [5]. On the application side, we may apply our methodology in the economic dispatch problem [20] in which the prediction of wind power (using non-parametric estimation) can be included in the design of policy to dispatch conventional generation units in real time. All in all, we expect this combination of estimation and optimization to be a fundamental step towards distribution-free algorithms in SP with covariates. Is this the start of new era of “estimization" (estimation + optimization)?

Data availability

The datasets generated and analyzed during the study are available in the Google drive, https://drive.google.com/drive/folders/1VPiaG5om57qBJ-zgMmox539OuIylljeh?usp=sharing. Please see the folders, LEONA and NSD.

Notes

See section 5.1.3 of [13] for a discussion of regular conditional distribution and the conditional expectation. We shall follow the notations of this very standard textbook by Çinlar [9].
Learning Enabled Optimization with Non-parametric Approximations.
We use Greek letters with $\tilde{}$ on top to denote the random variables and Greek letters for corresponding outcomes.

References

Altman, N.S.: An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat. 46(3), 175–185 (1992)
MathSciNet Google Scholar
Bassok, Y., Anupindi, R., Akella, R.: Single-period multiproduct inventory models with substitution. Oper. Res. 47(4), 632–642 (1999)
Article MathSciNet Google Scholar
Bertsimas, D., Kallus, N.: From predictive to prescriptive analytics. Manag. Sci.(2019)
Bertsimas, D., Tsitsiklis, J.N.: Introduction to Linear Optimization, vol. 6. Athena Scientific Belmont, Belmont (1997)
Google Scholar
Biau, G., Devroye, L.: Lectures on the Nearest Neighbor Method, vol. 246. Springer, New York (2015)
Book Google Scholar
Blundell, R., Duncan, A.: Kernel regression in empirical microeconomics. J. Hum. Resour. 62–87 (1998)
Botev, Z.I.: The normal law under linear restrictions: simulation and estimation via minimax tilting. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 79(1), 125–148 (2017)
Article MathSciNet Google Scholar
Cheng, P.E.: Applications of kernel regression estimation: survey. Commun. Stat. Theory Methods 19(11), 4103–4134 (1990)
Article MathSciNet Google Scholar
Çınlar, E.: Probability and Stochastics, vol. 261. Springer, New York (2011)
Google Scholar
Deng, Y., Sen, S.: Predictive stochastic programming. CMS 19(1), 65–98 (2022)
Article MathSciNet Google Scholar
Devroye, L.: The uniform convergence of nearest neighbor regression function estimators and their application in optimization. IEEE Trans. Inf. Theory 24(2), 142–151 (1978)
Article MathSciNet Google Scholar
Devroye, L., Gyorfi, L., Krzyzak, A., Lugosi, G., et al.: On the strong universal consistency of nearest neighbor regression function estimates. Ann. Stat. 22(3), 1371–1385 (1994)
Article MathSciNet Google Scholar
Durrett, R.: Probability: Theory and Examples, vol. 49. Cambridge University Press, Cambridge (2019)
Book Google Scholar
Elmachtoub, A.N., Grigas, P.: Smart “predict, then optimize”. Manag. Sci. (2021)
Ermoliev, Y.: Stochastic quasigradient methods and their application to system optimization. Stoch. Int. J. Probab. Stoch. Processes 9(1–2), 1–36 (1983)
MathSciNet Google Scholar
Ermoliev, Y.M., Gaivoronski, A.A.: Stochastic quasigradient methods for optimization of discrete event systems. Ann. Oper. Res. 39(1), 1–39 (1992)
Article MathSciNet Google Scholar
Fix, E., Hodges Jr, J.L.: Discriminatory analysis-nonparametric discrimination: consistency properties. Technical report, California University Berkeley (1951)
Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning, vol. 1. Springer, New York (2001)
Google Scholar
Gama, J.: A survey on learning from data streams: current and future trends. Prog. Artif. Intell. 1, 45–55 (2012)
Article Google Scholar
Gangammanavar, H., Sen, S., Zavala, V.M.: Stochastic optimization of sub-hourly economic dispatch with wind energy. IEEE Trans. Power Syst. 31(2), 949–959 (2015)
Article ADS Google Scholar
Gomes, H.M., Read, J., Bifet, A., Barddal, J.P., Gama, J.: Machine learning for streaming data: state of the art, challenges, and opportunities. ACM SIGKDD Explor. Newsl. 21(2), 6–22 (2019)
Article Google Scholar
Greblicki, W., Krzyzak, A.: Asymptotic properties of kernel estimates of a regression function. J. Stat. Plan. Inference 4(1), 81–90 (1980)
Article MathSciNet Google Scholar
Greblicki, W., Krzyzak, A., Pawlak, M., et al.: Distribution-free pointwise consistency of kernel regression estimate. Ann. Stat. 12(4), 1570–1575 (1984)
Article MathSciNet Google Scholar
Gröwe, N., Roemisch, W.: A stochastic programming model for optimal power dispatch: Stability and numerical treatment. In: Stochastic Optimization, pp. 111–139. Springer, Berlin, Heidelberg (1992)
Györfi, L., Kohler, M., Krzyzak, A., Walk, H.: A Distribution-Free Theory of Nonparametric Regression. Springer, New York (2002)
Book Google Scholar
Hannah, L., Powell, W., Blei, D.: Nonparametric density estimation for stochastic optimization with an observable state variable. Adv. Neural Inf. Process. Syst. 23 (2010)
Higle, J.L., Sen, S.: Stochastic decomposition: an algorithm for two-stage linear programs with recourse. Math. Oper. Res. 16(3), 650–669 (1991)
Article MathSciNet Google Scholar
Higle, J.L., Sen, S.: Finite master programs in regularized stochastic decomposition. Math. Program. 67(1), 143–168 (1994)
Article MathSciNet Google Scholar
Hu, Y., Kallus, N., Mao, X.: Fast rates for contextual linear optimization. Manag. Sci. (2022)
Izenman, A.J.: Review papers: recent developments in nonparametric density estimation. J. Am. Stat. Assoc. 86(413), 205–224 (1991)
MathSciNet Google Scholar
Kallus, N., Mao, X.: Stochastic optimization forests. Manag. Sci. (2022)
Kannan, R., Bayraksan, G., Luedtke, J.R.: Data-driven sample average approximation with covariate information. Optimization (2020)
Kannan, R., Bayraksan, G., Luedtke, J.R.: Residuals-based distributionally robust optimization with covariate information. arXiv preprint arXiv:2012.01088 (2020)
Kleywegt, A.J., Shapiro, A., Homem-de-Mello, T.: The sample average approximation method for stochastic discrete optimization. SIAM J. Optim. 12(2), 479–502 (2002)
Article MathSciNet Google Scholar
Kozek, A.S., Leslie, J.R., Schuster, E.F., et al.: On a universal strong law of large numbers for conditional expectations. Bernoulli 4(2), 143–165 (1998)
Article MathSciNet Google Scholar
Liu, J., Sen, S.: Asymptotic results of stochastic decomposition for two-stage stochastic quadratic programming. SIAM J. Optim. 30(1), 823–852 (2020)
Article MathSciNet Google Scholar
Liu, J., Li, G., Sen, S.: Coupled learning enabled stochastic programming with endogenous uncertainty. Math. Oper. Res. (2021)
Ma, L., Crawford, M.M., Tian, J.: Local manifold learning-based $ k $-nearest-neighbor for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 48(11), 4099–4109 (2010)
Google Scholar
Nadaraya, E.A.: On estimating regression. Theory Probab. Appl. 9(1), 141–142 (1964)
Article Google Scholar
Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)
Article MathSciNet Google Scholar
Nguyen, V.A., Zhang, F., Blanchet, J., Delage, E., Ye, Y.: Distributionally robust local non-parametric conditional estimation. Adv. Neural. Inf. Process. Syst. 33, 15232–15242 (2020)
Google Scholar
Nguyen, V.A., Zhang, F., Blanchet, J., Delage, E., Ye, Y.: Robustifying conditional portfolio decisions via optimal transport. arXiv preprint arXiv:2103.16451 (2021)
Ortiz-García, E., Salcedo-Sanz, S., Casanova-Mateo, C.: Accurate precipitation prediction with support vector classifiers: a study including novel predictive variables and observational data. Atmos. Res. 139, 128–136 (2014)
Article Google Scholar
Polyak, B.T., Juditsky, A.B.: Acceleration of stochastic approximation by averaging. SIAM J. Control. Optim. 30(4), 838–855 (1992)
Article MathSciNet Google Scholar
Rudin, W.: Principles of Mathematical Analysis, vol. 3. McGraw-hill, New York (1964)
Google Scholar
Sadana, U., Chenreddy, A., Delage, E., Forel, A., Frejinger, E., Vidal, T.: A survey of contextual optimization methods for decision making under uncertainty. arXiv preprint arXiv:2306.10374 (2023)
Sen, S., Liu, Y.: Mitigating uncertainty via compromise decisions in two-stage stochastic linear programming: variance reduction. Oper. Res. 64(6), 1422–1437 (2016)
Article MathSciNet Google Scholar
Sen, S., Zhou, Z.: Multistage stochastic decomposition: a bridge between stochastic programming and approximate dynamic programming. SIAM J. Optim. 24(1), 127–153 (2014)
Article MathSciNet Google Scholar
Takeda, H., Farsiu, S., Milanfar, P.: Kernel regression for image processing and reconstruction. IEEE Trans. Image Process. 16(2), 349–366 (2007)
Article MathSciNet PubMed ADS Google Scholar
Tan, S.: Neighbor-weighted k-nearest neighbor for unbalanced text corpus. Expert Syst. Appl. 28(4), 667–671 (2005)
Article Google Scholar
Walk, H.: A universal strong law of large numbers for conditional expectations via nearest neighbors. J. Multivar. Anal. 99(6), 1035–1050 (2008)
Article MathSciNet Google Scholar
Walk, H.: Strong laws of large numbers and nonparametric estimation. In: Recent Developments in Applied Probability and Statistics, pp. 183–214. Springer, Heidelberg (2010)
Wasserman, L.: All of Nonparametric Statistics. Springer, Berlin (2006)
Google Scholar
Watson, G.S.: Smooth regression analysis. Sankhyā: The Indian Journal of Statistics, Series A, 359–372 (1964)

Download references

Acknowledgements

We thank two anonymous reviewers for their helpful advice to improve the paper. We also thank Prof. Phebe Vayanos for her participation in early discussion of this research.

Funding

Open access funding provided by SCELC, Statewide California Electronic Library Consortium “Distribution-free” methods using predictor-response covariates are highly relevant for modern data-driven optimization. This research was supported by grants from AFOSR FA9550-20-1-0006 and Office of Naval Research Grant N00014-20-1-2077.

Author information

Authors and Affiliations

Daniel J. Epstein Department of Industrial and Systems Engineering, University of Southern California, 3715 McClintock Ave, Los Angeles, CA, 90089-0193, USA
Shuotao Diao & Suvrajeet Sen

Authors

Shuotao Diao
View author publications
You can also search for this author in PubMed Google Scholar
Suvrajeet Sen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Suvrajeet Sen.

Ethics declarations

Conflict of interest

There is no conflict of interest.

Code availability

The code for SD-kNN is available at https://github.com/ShuotaoDiao/NSD3.1.git. The code for LEONA is available at https://github.com/ShuotaoDiao/LEONA1.0.git.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Proofs

1.1 Proof of Lemma 3.3

Proof

Let $x, x^* \in X$, where $x^*$ denotes an optimal solution to (2). By the convexity of ${\hat{f}}_N(\cdot ,\omega ;S^N))$ in Lemma 3.2, we have $ {\hat{f}}_N(x^*,\omega ;{\tilde{S}}^N) - {\hat{f}}_N(x,\omega ;{\tilde{S}}^N) \ge (x^* - x)^\top {\hat{G}}_N(x,\omega ;{\tilde{S}}^N) \ w.p. 1$. By taking the expectation of both sides of the equation above, we get

$$\begin{aligned} {\mathbb {E}}[{\hat{f}}_N(x^*,\omega ;{\tilde{S}}^N)] - {\mathbb {E}}[{\hat{f}}_N(x,\omega ;{\tilde{S}}^N)] \ge (x^* - x)^\top {\mathbb {E}}[{\hat{G}}_N(x,\omega ;{\tilde{S}}^N)] \end{aligned}$$

(A1)

Combining the assumption A6 and equation (A1), we obtain $ f(x^*,\omega ) - f(x,\omega ) \ge (x^* - x)^\top {\mathbb {E}}[{\hat{G}}_N(x,\omega ;{\tilde{S}}^N)] + (- \delta _N(x^*,\omega ) - \delta _N(x,\omega )). $ Let $\tau _N = - \delta _N(x^*,\omega ) - \delta _N(x,\omega )$. By assumption A6, we have $\tau _N \rightarrow 0$ as $N \rightarrow \infty $. By Definition 5, it shows that ${\hat{G}}_N(x,\omega ;{\tilde{S}}^N)$ is the stochastic quasi-gradient of $f(x,\omega )$. $\square $

1.2 Proof of Lemma 3.4

Proof

By Lemma 3.2, ${\hat{f}}_{N}(\cdot ,\omega ;S^N)$ is convex on X, which implies that the following inequality holds,

$$\begin{aligned} \begin{aligned} {\hat{f}}_{N_l}(x',\omega ;{\tilde{S}}^{N_l}_l) \ge {\hat{f}}_{N_l}(x_l,\omega ;{\tilde{S}}^{N_l}_l) + (x' - x_l)^\top {\hat{G}}_{N_l}(x_l,\omega ;{\tilde{S}}^{N_l}_l) \ \text {w.p.1} \end{aligned} \end{aligned}$$

(A2)

By the non-expansiveness property of projection operator, we have $\Vert \Pi _X(x') - \Pi _X(x)\Vert $ $\le \Vert x' - x\Vert $. On the other hand, if $x' \in X$, then $\Pi _X(x') = x'$. Hence, we have

$$\begin{aligned} \Vert x_{l+1} - x'\Vert ^2&= \Vert \Pi _X(x_l - \gamma _l {\hat{G}}_{N_l}(x_l,\omega ; {\tilde{S}}^{N_l}_l)) - \Pi _X(x')\Vert ^2 \\&\le \Vert x_l - x'\Vert ^2 + \gamma _l^2 \Vert {\hat{G}}_{N_l}(x_l,\omega ;{\tilde{S}}^{N_l}_l)\Vert ^2 \\&\quad - 2 \gamma _l(x_l - x')^\top {\hat{G}}_{N_l}(x_l,\omega ;{\tilde{S}}^{N_l}_l) \end{aligned}$$

By assumption A4, we have $\Vert G(x,{\tilde{\xi }})\Vert \le M_G < \infty \ \text {a.s.}$. By assumption A0, $\sum _{i=1}^{N_l} v_{N_l,i}(\omega ) \le 1$ and $v_{N_l,i}(\omega ) \ge 0$, which implies that $\Vert {\hat{G}}_{N_l}(x_l,\omega ;{\tilde{S}}^{N_l}_l) \Vert \le M_G \ \text {a.s.}$. Therefore, by taking the expectation of the both sides of inequality above, we have

$$\begin{aligned} \begin{aligned} {\mathbb {E}}[\Vert x_{l+1} - x' \Vert ^2] \le {\mathbb {E}}[\Vert x_l - x' \Vert ^2] + \gamma _l^2 M_G^2 - 2 \gamma _l {\mathbb {E}}[(x_l - x')^\top {\hat{G}}_{N_l}(x_l,\omega ;{\tilde{S}}^{N_l}_l)] \end{aligned} \end{aligned}$$

(A3)

Note ${\tilde{S}}^{N_j}_j = \{({\tilde{\omega }}_i, {\tilde{\xi }}_i)\}_{i=I_j}^{I_j + N_j}$ is the dataset generated at the $j^{th}$ iteration by using the NSQG Update. Also note that $x_l$ is a function of the history ${\tilde{S}}^{[l-1]} \triangleq ({\tilde{S}}^{N_0}_0, \ldots ,{\tilde{S}}^{N_{l-1}}_{l-1})$. In other words, $x_l$ is measurable with respect to the $\sigma $-algebra generated by ${\tilde{S}}^{[l-1]}$. By taking the conditional expectation of (A2) on the history ${\tilde{S}}^{[l-1]}$ and moving ${\hat{f}}_{N}(x_l,\omega ;{\tilde{S}}^{N_l})$ to the left hand side of (A2), we get,

$$\begin{aligned} \begin{aligned} {\mathbb {E}}[{\hat{f}}_{N_l}(x',\omega ;{\tilde{S}}^{N_l}_l) - {\hat{f}}_{N_l}(x_l,\omega ;{\tilde{S}}^{N_l}_l) \vert {\tilde{S}}^{[l-1]}] \ge {\mathbb {E}}[(x' - x_l)^\top {\hat{G}}_{N_l}(x_l,\omega ;{\tilde{S}}^{N_l}_l) \vert {\tilde{S}}^{[l-1]}] \end{aligned} \end{aligned}$$

(A4)

Based on the assumption that $\vert {\mathbb {E}}[{\hat{f}}_{N}(x,\omega ;{\tilde{S}}^{N})] - f(x,\omega ) \vert \le \delta _{N}(x,\omega )$, we have

$$\begin{aligned} \begin{aligned} {\mathbb {E}}[f(x',\omega )&- f(x_l,\omega ) + \delta _{N_l}(x',\omega ) + \delta _{N_l}(x_l,\omega ) \vert {\tilde{S}}^{[l-1]}]\\&\ge {\mathbb {E}}[{\hat{f}}_{N_l}(x',\omega ;{\tilde{S}}^{N_l}_l) - {\hat{f}}_{N_l}(x_l,\omega ;{\tilde{S}}^{N_l}_l) \vert {\tilde{S}}^{[l-1]}] \end{aligned} \end{aligned}$$

(A5)

By combining (A4) and (A5), we obtain

$$\begin{aligned}&{\mathbb {E}}[f(x',\omega ) - f(x_l,\omega ) + \delta _{N_l}(x',\omega ) + \delta _{N_l}(x_l,\omega ) \vert S^{[l-1]}] \\&\quad \ge {\mathbb {E}}[(x' - x_l)^\top {\hat{G}}_{N_l}(x_l,\omega ;{\tilde{S}}^{N_l}) \vert {\tilde{S}}^{[l-1]}] \\ \end{aligned}$$

Taking the expectation of both sides of the above inequality, we obtain

$$\begin{aligned}{} & {} {\mathbb {E}}[f(x',\omega ) - f(x_l,\omega ) + \delta _{N_l}(x',\omega ) + \delta _{N_l}(x_l,\omega )]\nonumber \\{} & {} \quad \ge {\mathbb {E}}[(x' - x_l)^\top {\hat{G}}_{N_l}(x_l,\omega ;{\tilde{S}}^{N_l}_l)]. \end{aligned}$$

(A6)

By using (A6) to replace $-{\mathbb {E}}[(x_l - x')^\top G_{N}(x_l,\omega ;{\tilde{S}}^{N_l}_l)]$ in (A3) with ${\mathbb {E}}[f(x',\omega ) - f(x_l,\omega ) + \delta _{N_l}(x',\omega ) + \delta _{N_l}(x_l,\omega )] $, we get the desired result. $\square $

1.3 Proof of Lemma 3.5

Proof

According to Lemma 3.4 and letting $a_l = {\mathbb {E}}[\Vert x_l - x^*\Vert ^2]$, we have

$$\begin{aligned} a_{l+1} \le a_l - 2 \gamma _l {\mathbb {E}}\left[ f(x_l,\omega ) - c^*(\omega )\right] + \gamma _l^2 M_G^2 + 2 \gamma _l {\mathbb {E}}\left[ \delta _{N_l}(x_l,\omega ) + \delta _{N_l}(x^*,\omega ) \right] . \end{aligned}$$

(A7)

By assumption A6, we conclude that $A_l \rightarrow 0$ as $l \rightarrow \infty $. It follows from (A7) that

$$\begin{aligned} \gamma _l \left[ {\mathbb {E}}[f(x_l,\omega ) - c^*(\omega )] - 2 A_l \right] \le \frac{1}{2}a_l - \frac{1}{2} a_{l+1} + \frac{1}{2} \gamma _l^2 M_G^2. \end{aligned}$$

(A8)

Considering $i_q \le \gamma \le j_q$, we have

$$\begin{aligned} \sum _{l = i_q}^{j_q} \gamma _l [{\mathbb {E}}[f(x_l,\omega ) - c^*(\omega )] - 2 A_l]&\le \sum _{l = i_q}^{j_q} (\frac{1}{2}a_l - \frac{1}{2} a_{l+1}) + \sum _{l = i_q}^{j_q} \frac{1}{2} \gamma _l^2 M_G^2 \end{aligned}$$

(A9a)

$$\begin{aligned}&\le \frac{1}{2} a_{i_q} + \frac{1}{2} M_G^2 \sum _{l = i}^j \gamma _l^2. \end{aligned}$$

(A9b)

Dividing both sides by $\sum _{l = i}^j \gamma _l$ we obtain

$$\begin{aligned} \frac{ \sum _{l = i_q}^{j_q} \gamma _l {\mathbb {E}}[f(x_l,\omega ) - c^*(\omega )]}{\sum _{l = i_q}^{j_q} \gamma _l} - \frac{2 \sum _{l = i_q}^{j_q} \gamma _l A_l}{\sum _{l = i_q}^{j_q} \gamma _l} \le \frac{\frac{1}{2} a_{i_q} + \frac{1}{2} M_G^2 \sum _{l = i_q}^{j_q} \gamma _l^2}{\sum _{l = i_q}^{j_q} \gamma _l}. \end{aligned}$$

Since $\gamma _l = {\tilde{\gamma }}_q = \frac{2 Dx}{M_G \sqrt{mq}}$ for l satisfying $i_q \le l \le j_q$, we have $ \frac{\gamma _l}{\sum _{l = i_q}^{j_q} \gamma _l} = \frac{1}{mq}$. By the convexity of $f(\cdot ,\omega )$ for a given $\omega $, we have $ f(\sum _{l=i_q}^{j_q} \frac{1}{mq} x_l, \omega )\le \sum _{l=i_q}^{j_q} \frac{1}{mq} f(x_l,\omega )$. Now thus $a_{i_q} = {\mathbb {E}}[\Vert x_{i_q} - x^* \Vert ^2] \le {\mathbb {E}}\left[ (\Vert x_{i_q} - x_0 \Vert + \Vert x_0 - x^* \Vert )^2\right] \le 4D_X^2$. Furthermore, ${\tilde{x}}_q = \frac{1}{mq} \sum _{l=i_q}^{j_q} x_l$, we get

$$\begin{aligned} {\mathbb {E}}[f({\tilde{x}}_q,\omega ) - c^*(\omega ))] \le \frac{2}{mq} \sum _{l=i_q}^{j_q} A_l + \frac{2D_X^2 + \frac{1}{2} M_G^2 mq {\tilde{\gamma }}_q^2}{mq {\tilde{\gamma }}_q}. \end{aligned}$$

(A10)

Plug $ {\tilde{\gamma }}_q = \frac{2 D_X}{M_G \sqrt{mq}}$ into (A10), we have

$$\begin{aligned} {\mathbb {E}}[f({\tilde{x}}_q,\omega ) - c^*(\omega ))] \le \frac{2}{mq} \sum _{l=i_q}^{j_q} A_l + \frac{2 D_X M_G }{\sqrt{mq}}. \end{aligned}$$

(A11)

$\square $

1.4 Proof of Proposition 4.2

Proof

Case a. ($k_{l+1} = k_l$)

Let $\omega ^l_{[k_l]}$ be the $k_l$ nearest neighbor of $\omega $ from the set $\{ \omega _j \}_{j=1}^l$.

If $\Vert \omega ^l_{[k_l]} - \omega \Vert \le \Vert \omega _{l+1} - \omega \Vert $, then ${\mathbb {I}}(\omega _{l+1} \in {\mathcal {S}}(k,\omega , \{\omega _j \}_{j=1}^{l+1}))) = 0$ and ${\mathbb {I}}(\omega ^l_{[i]} \in {\mathcal {S}}(k,\omega , \{\omega _j \}_{j=1}^{l+1}))) = 1$ for $i = 1,2,\ldots ,k_l$. Hence, ${\hat{f}}^{k_{l+1}}_{l+1}(x,\omega ; S^{l+1}) = {\hat{f}}^{k_l}_l(x,\omega ;S^{l}) $. By assumption, ${\hat{f}}^{k_l}_l(x,{\tilde{\omega }}) \ge c^\top x + {\mathcal {C}}^l_i(x,\omega )$ for $i \in {\mathcal {J}}_l$. Since $M_h > 0$, it is obvious that

$$\begin{aligned} {\hat{f}}^{k_{l+1}}_{l+1}(x,\omega ;S^{l+1}) = {\hat{f}}^{k_l}_l(x,\omega ;S^{l}) \ge c^\top x + {\mathcal {C}}^l_i(x,\omega ) - \frac{M_h}{k_l}. \end{aligned}$$

If $\Vert \omega ^l_{[k_l]} - \omega \Vert > \Vert \omega _{l+1} - \omega \Vert $, then $\omega ^l_{[k_l]}$ will be removed from the $k_{l+1}$ nearest neighbor set in iteration $l+1$ while ${\tilde{\omega }}_{l+1}$ will be added to the nearest neighbor set in iteration $l+1$. That is ${\mathbb {I}}(\omega ^l_{[k_l]} \in {\mathcal {S}}(k,\omega , \{\omega _{i} \}_{j=1}^{l+1}))) = 0$ and ${\mathbb {I}}(\omega _{l+1} \in {\mathcal {S}}(k,\omega , \{\omega _j \}_{j=1}^{l+1}))) = 1$. Since the sample size only increases by 1 per iteration, ${\mathbb {I}}(\omega ^l_{[i]} \in {\mathcal {S}}(k,\omega , \{\omega ^l_{[k_l]} \}_{j=1}^{l+1}))) = 1$ for all $1 \le i \le k_l - 1$. Hence, $0 \le h(\cdot ,\cdot ) \le M_h$ implies that

$$\begin{aligned} \begin{aligned} {\hat{f}}^{k_{l+1}}_{l+1}(x,\omega ; S^{l+1}) - {\hat{f}}^{k_{l}}_{l}(x,\omega ; S^{l})&= \frac{1}{k_{l+1}}h(x,\xi _{l+1}) - \frac{1}{k_l} h(x,\xi ^l_{[k_l]}) \\&\ge \frac{1}{k_{l+1}} \times 0 - \frac{1}{k_l} \times M_h = - \frac{M_h}{k_l}. \end{aligned} \end{aligned}$$

By the assumption on ${\mathcal {C}}^l_i(x,\omega )$ in the proposition, it follows from the equation above that

$$\begin{aligned} \begin{aligned} {\hat{f}}^{k_{l+1}}_{l+1}(x,\omega ; S^{l+1})&\ge c^\top x + {\mathcal {C}}^l_i(x,\omega ) - \frac{M_h}{k_l}. \quad \forall \ i \in {\mathcal {J}}_l. \end{aligned} \end{aligned}$$

Case b. ($k_{l+1} > k_l$).

Since the sample size only increases by 1 per iteration, we have

$$\begin{aligned} {\mathbb {I}}(\omega ^l_{[i]} \in {\mathcal {S}}(k,\omega , \{\omega ^l_{j} \}_{j=1}^{l+1}))) = 1, \end{aligned}$$

(A12)

for all $1 \le i \le k_l$. On the other hand, equation (A12) implies that

$$\begin{aligned} \begin{aligned} \sum _{i=1}^l {\mathbb {I}}\left( \omega _i \in {\mathcal {S}}(k,\omega , \{{\tilde{\omega }}_j \}_{j=1}^{l}) \right) h(x,\xi _i)&= \sum _{i=1}^{k_l} {\mathbb {I}}(\omega ^l_{[i]} \in {\mathcal {S}}(k,\omega , \{\omega ^l_j \}_{j=1}^{l+1}))). h(x,\xi ^l_{[i]}) \end{aligned} \end{aligned}$$

Hence, $h(\cdot ,\cdot ) \ge 0$ implies that

$$\begin{aligned} \begin{aligned} {\hat{f}}^{k_{l+1}}_{l+1}(x,{\tilde{\omega }};S^{l+1}) - c^\top x&\ge \frac{1}{k_{l+1}} \sum _{i}^{k_l} {\mathbb {I}}(\omega ^l_{[i]} \in {\mathcal {S}}(k,\omega , \{\omega ^l_{j}\}_{i=1}^{l+1})h(x,\xi ^l_{[i]}) + \frac{k_{l+1} - k_{l}}{k_{l+1}} \times 0 \\&\ge \frac{1}{k_{l+1}} \sum _{i=1}^l {\mathbb {I}}\left( \omega _i \in {\mathcal {S}}(k,\omega , \{\omega _j \}_{j=1}^{l}) \right) h(x,\xi _i). \end{aligned} \end{aligned}$$

Since $ \frac{1}{k_l}\sum _{i=1}^l {\mathbb {I}}\left( \omega _i \in {\mathcal {S}}(k,\omega , \{\omega _j \}_{j=1}^{l}) \right) h(x,\xi _i) = {\hat{f}}^{k_{l}}_{l}(x,\omega ; S^{l}) - c^\top x,$ it implies that

$$\begin{aligned} \sum _{i=1}^l {\mathbb {I}}\left( \omega _i \in {\mathcal {S}}(k,\omega , \{\omega _j \}_{j=1}^{l}) \right) h(x,\xi _i) = k_l \left( {\hat{f}}^{k_{l}}_{l}(x,\omega ; S^{l}) - c^\top x \right) , \end{aligned}$$

which further implies that

$$\begin{aligned} \begin{aligned} {\hat{f}}^{k_{l+1}}_{l+1}(x,\omega ; S^{l+1}) - c^\top x&\ge \frac{k_l}{k_{l+1}} ({\hat{f}}^{k_{l}}_{l}(x,\omega ;S^{l}) - c^\top x) + \frac{k_{l+1} - k_l}{k_{l+1}} \times 0\\&= \frac{k_l}{k_{l+1}} {\mathcal {C}}^l_i(x,\omega ), \quad \text {for} \ i \in {\mathcal {J}}_l. \end{aligned} \end{aligned}$$

Hence this shows that ${\hat{f}}^{k_{l+1}}_{l+1}(x,\omega ;S^{l+1}) \ge c^\top x + \frac{k_l}{k_{l+1}} {\mathcal {C}}^l_i(x,\omega )$, for all $i \in {\mathcal {J}}_l$. $\square $

1.5 Proof of Theorem 4.4

Before we prove Theorem 4.4, we shall prove some preliminary results. We start with the theorem which shows that the kNN estimate of the true objective function converges to the original objective function with probability one. Then we show that, with probability one, the sequence of minorants from SD-kNN converges to ${\mathbb {E}}_{{\tilde{\xi }} \sim K(\omega ,\cdot )}[h({\bar{x}},{\tilde{\xi }}) \vert {\tilde{\omega }} = \omega ]$, where ${\bar{x}}$ is any arbitrary accumulation point of the sequence of candidate/incumbent solutions.

Theorem A.1

Suppose that assumptions B1–B8 are satisfied. With probability one, the following holds:

a.
$\lim \limits _{l \rightarrow \infty } \frac{1}{k_l} \sum _{i=1}^l {\mathbb {I}}\left( {\tilde{\omega }}_i \in {\mathcal {S}}(k_l,\omega ; \{{\tilde{\omega }}_j \}_{j=1}^l) \right) h(x,{\tilde{\xi }}_i) = {\mathbb {E}}_{{\tilde{\xi }} \sim K(\omega ,\cdot )}[h(x,{\tilde{\xi }}) \vert {\tilde{\omega }} = \omega ]$
b.
$\lim \limits _{l \rightarrow \infty } \frac{1}{k_l} \sum _{i=1}^{k_l} h(x,{\tilde{\xi }}^l_{[i]}) = {\mathbb {E}}_{{\tilde{\xi }} \sim K(\omega ,\cdot )}[h(x,{\tilde{\xi }}) \vert {\tilde{\omega }} = \omega ]$.
c.
$\lim \limits _{l \rightarrow \infty } {\hat{f}}_l^{k_l} (x,\omega ;{\tilde{S}}^l) = c^\top x + {\mathbb {E}}_{{\tilde{\xi }} \sim K(\omega ,\cdot )}[h(x,{\tilde{\xi }}) \vert {\tilde{\omega }} = \omega ]$.

Proof

See [51, Theorem 1]. $\square $

The following lemma shows that the second-stage recourse function $h(\cdot ,\xi )$ is Lipschitz continuous on X with a common constant for almost every $\xi \in {\mathcal {Y}}$.

Lemma A.2

Suppose that assumptions B1–B8 are satisfied, then the following hold: The subdifferential of $h(\cdot ,\xi )$ is contained in an compact set for all $\xi \in {\mathcal {Y}}$. Furthermore, $h(\cdot ,\xi )$ is Lipschitz continuous on X with a common constant for almost every $\xi \in {\mathcal {Y}}$.

Proof

Let V denote the set of dual extreme points of $\{\pi : D^\top \pi \le d^\top \}$, which contains finitely many unique elements. The dual of the second-stage problem can be written as $h(x,\xi ) = \max \{\pi ^\top (e(\xi ) - C(\xi ) x): \pi \in V \}$. By assumption B7, we have $\Vert C(\xi )^\top \pi \Vert \le M_C \max \limits _{\pi \in V}\Vert \pi \Vert $. Furthermore, since V is finite, there exists $M_\pi \in (0, \infty )$ such that $\Vert C(\xi )^\top \pi \Vert \le M_C M_\pi $. Combining with the fact that $h(\cdot ,\xi )$ is a piecewise linear function on X, we finish the proof of Lemma A.2. Furthermore, this implies that $h(\cdot ,\xi )$ is Lipschitz continuous on X with a common constant for almost every $\xi \in {\mathcal {Y}}$. $\square $

Then, let us first analyze the effect of storing dual extreme points from the SD-kNN. We let $V_l$ denote the set of dual vertices explored by SD-kNN in the $l^{th}$ iteration and let $h_l(x,\xi ) = \max \{\pi ^\top (e(\xi ) - C(\xi ) x): \pi \in V_l \}$. Let V denote the set of all the vertices of $\{\pi : \pi ^\top D \le d^\top \}$.

Lemma A.3

Suppose that assumptions B1–B8 are satisfied. The sequence of functions $\{h_l(\cdot ,\xi )\}_l$ converges uniformly on X for almost every $\xi \in {\mathcal {Y}}$.

Proof

Note that the dual extreme point exploration in SD-kNN is the same as the one in SD. See [27, Lemma 1]. $\square $

Lemma A.4

Suppose that assumptions B1–B8 are satisfied. The sequence of functions $\{{\hat{f}}_l^{k_l} (x,\omega ;S^l)\}_{l=1}^\infty $ is equicontinuous on X.

Proof

The result is true because $h(\cdot ,\xi )$ is Lipschitz continuous with a common constant, Lip, for all $\xi \in {\mathcal {Y}}$ (by Lemma A.2) and $\frac{1}{k_l} \sum _{i=1}^l {\mathbb {I}}\left( \omega _i \in {\mathcal {S}}(k_l,\omega ; \{\omega _j \}_{j=1}^l) \right) = 1$ and ${\mathbb {I}}\left( \omega _i \in {\mathcal {S}}(k_l,\omega ; \{\omega _j \}_{j=1}^l) \right) \ge 0$ for all i amd l. $\square $

With the equicontinuity of the sequence of kNN benchmark functions, $\{{\hat{f}}_l^{k_l} (x,\omega ;S^l)\}_{l=1}^\infty $, we further show $\{{\hat{f}}_l^{k_l} (x,\omega ;S^l)\}_{l=1}^\infty $ converges to the true objective function, $f(x,\omega )$, in (15) uniformly on X with probability one.

Theorem A.5

Suppose that assumptions B1–B7 are satisfied. Let $k_l$ be monotone increasing with l, $k_l \rightarrow \infty $, $\frac{k_l}{l} \rightarrow 0$ (as $l \rightarrow \infty $) and ($k_l$) varies regularly with exponent $\beta \in (0,1]$ (e.g., $k_l = \lfloor l^{\beta } \rfloor $, $\beta \in (0,1)$). With probability one,

$$\begin{aligned} \lim _{l \rightarrow \infty } {\hat{f}}_l^{k_l} (x,\omega ;S^l) = f(x,\omega ), \quad \text {uniformly on } X. \end{aligned}$$

Proof

The result follows by using Theorem A.1, Lemma A.4 and [45, Theorem 7.25]. $\square $

Lemma A.6

Suppose assumption B1 is satisfied. With probability one, both the accumulation point of $\{x^l \}_l^{\infty }$ and the accumulation point of $\{{\hat{x}}^l \}_l^{\infty }$ exist.

Proof

The result follows because X is compact. $\square $

With assumption B3–B5, we can show that with probability one, given a realization of the covariate, $(\omega _{l_0},\xi _{l_0})$, there exists a covariate, say $(\omega _{l_1},\xi _{l_1})$ ($l_1 > l_0$), in the subsequent sequence generated by the oracle that is arbitrarily close to the $(\omega _{l_0},\xi _{l_0})$, in terms of the distance $\Vert \xi _{l_0} - \xi _{l_1}\Vert $.

Lemma A.7

Suppose assumptions B3–B5 hold. With probability one, for all $i \in {\mathbb {N}}$ and $\delta > 0$, the sequence of $\{{\tilde{\xi }}_l\}_{l=i+1}^{\infty }$ enters the ball with center ${\tilde{\xi }}_i$ and radius $\delta $ infinitely often.

Proof

Let $\{\xi _i\}_{i \in {\mathbb {N}}}$ be the countable subset of ${\mathcal {Y}}$ that is also dense in ${\mathcal {Y}}$. Let ${\hat{\Phi }}_{i,k} = \{\Vert {\tilde{\xi }}_l - \xi _i \Vert < \frac{1}{k} \ i.o. \}$. For fixed i and k, by assumption B4, there exists $\epsilon > 0$ such that ${\mathbb {P}}(\Vert {\tilde{\xi }}_l - \xi _i \Vert \le \frac{1}{k}) = \epsilon $ for all $l \ge i+1$. By the second Borel-Cantelli Lemma (see Theorem 2.3.6 in [13]), $\sum _{l=i+1}^\infty {\mathbb {P}}(\Vert {\tilde{\xi }}_l - \xi _i \Vert < \frac{1}{k}) = \sum _{l=i+1}^\infty \epsilon = \infty $ implies that ${\mathbb {P}}({\hat{\Phi }}_{i,k}) = {\mathbb {P}}(\Vert {\tilde{\xi }}_l - \xi _i \Vert < \frac{1}{k} \ i.o.) = 1$. Let ${\bar{\Phi }}_i = \cap _{k \in {\mathbb {N}}} {\hat{\Phi }}_{i,k}$. Hence, this implies that ${\mathbb {P}}({\bar{\Phi }}_i) = {\mathbb {P}}(\cap _{k \in {\mathbb {N}}} {\hat{\Phi }}_{i,k}) = 1.$ Furthermore, by considering all $i \in {\mathbb {N}}$, we have $ {\mathbb {P}}(\cap _{i \in {\mathbb {N}}} {\bar{\Phi }}_i) = 1.$ Note that for any $\xi \in {\mathcal {Y}}$ and $\delta > 0$, there exists $i,k \in {\mathbb {N}}$ such that $\frac{1}{k} < \frac{1}{2}\delta $ and $\Vert \xi - \xi _i \Vert < \frac{1}{2} \delta $. Therefore, ${\mathbb {P}}(\Vert {\tilde{\xi }}_l - \xi \Vert < \delta \ i.o.) =1$. Hence, this shows that with probability one, for all $i \in N$ and $\delta > 0$, the sequence of $\{{\tilde{\xi }}_l\}_{l=i+1}^{\infty }$ enters the ball with center ${\tilde{\xi }}_i$ and radius $\delta $ infinitely often. $\square $

To prepare for the following analysis about the limiting minorants, we denote following set of events:

$$\begin{aligned} \begin{aligned}&\Psi _1 \triangleq \{\zeta \in \Omega : \lim _{l \rightarrow \infty } \sum _{i=1}^l {\mathbb {I}}\left( {\tilde{\omega }}_i \in {\mathcal {S}}(k_l,\omega ; \{{\tilde{\omega }}_j \}_{j=1}^l) \right) h(x,{\tilde{\xi }}_i) \\&\quad = {\mathbb {E}}_{{\tilde{\mu }} \sim K(\omega ,\cdot )}[h(x,{\tilde{\xi }}) \vert {\tilde{\omega }} = \omega ] \},\\&\Psi _2 \triangleq \{\zeta \in \Omega : \text {accumulation point of } \{x_l(\zeta ) \} \text { exists}\},\\&\Psi _3 \triangleq \{ \zeta \in \Omega : \text {for all } i \in {\mathbb {N}}\text { and } \delta > 0, \text { the sequence of }\{{\tilde{\xi }}_l\}_{l=i+1}^{\infty } \\&\quad \quad \text {enter the ball with center }{\tilde{\xi }}_i \text { and radius }\delta \text { infinitely often}\}. \end{aligned} \end{aligned}$$

(A13)

By Theorem A.1, Lemma A.6, and Lemma A.7, ${\mathbb {P}}(\Psi _1) = 1$, ${\mathbb {P}}(\Psi _2) = 1$, and ${\mathbb {P}}(\Psi _3) =1$. This further implies that ${\mathbb {P}}(\Psi _1 \cap \Psi _2 \cap \Psi _3) = 1$. Hence, to show ${\mathbb {P}}(A) = 1$ is equivalent to show ${\mathbb {P}}(A \vert \Psi _1 \cap \Psi _2 \cap \Psi _3) = 1$.

With all the results derived above, we prove Theorem 4.4 as follows.

Proof of Theorem 4.4

Consider $\zeta \in \Psi _1 \cap \Psi _2 \cap \Psi _3$ (i.e., conditional on $\Psi _1 \cap \Psi _2 \cap \Psi _3$). By definition,

$$\begin{aligned} h_{l_n}(x^{l_n}(\zeta ),\xi ^{l_n}_{[i]}(\zeta )) = {\hat{\pi }}^{l_n}_{[i]}(\zeta )^{\top } (e(\xi ^{l_n}_{[i]}(\zeta )) - C(\xi ^{l_n}_{[i]}(\zeta )) x^{l_n}(\zeta )). \end{aligned}$$

Let ${\hat{x}}(\zeta )$ be one accumulation point of $\{x_l(\zeta )\}_l$ and let $\lim _{l_n \rightarrow \infty } x^{l_n}(\zeta ) = {\hat{x}}(\zeta )$. For notational simplicity, we shall remove $\zeta $ throughout the proof. By Lemma A.3, there exists a function g such that

$$\begin{aligned} \lim _{l_n \rightarrow \infty } \frac{1}{k_{l_n}} \sum _{i=1}^{l_n} {\mathbb {I}}\left( {\tilde{\omega }}_i \in {\mathcal {S}}(k_l,\omega ; \{{\tilde{\omega }}_j \}_{j=1}^{l_n}) \right) \left( h_{l_n}(x^{l_n},{\tilde{\xi }}_{i}) - g({\hat{x}},{\tilde{\xi }}_{i})\right) = 0. \end{aligned}$$

For a given ${\tilde{\xi }}_i$ (i.e., the response generated in the $i^{th}$ iteration), since $h(\cdot , \cdot )$ is continuous on $X \times {\mathcal {Y}}$ and $\{h_l \}_{l}^{\infty }$ is uniformly convergent on X, for any $\epsilon > 0$, there exist $\delta $ and L such that $\forall \ (x,\xi ) \in X \times {\mathcal {Y}} \ s.t. \ \Vert (x,\xi ) - ({\hat{x}},{\tilde{\xi }}_i) \Vert \le \delta $ the following holds, $\vert h(x,\xi ) - h({\hat{x}},{\tilde{\xi }}_i) \vert \le \frac{\epsilon }{3}.$ Moreover,$\vert h_{l_n}(x,\xi ) - h_{l_n}({\hat{x}},{\tilde{\xi }}_i) \vert \le \frac{\epsilon }{3}, \quad \forall \ l_n \ge L. $ Since ${\mathbb {P}}(\Vert {\tilde{\xi }} - {\tilde{\xi }}_i \Vert < \delta ) > 0$ for any $\delta > 0$, (By the Borel-Cantelli Lemma) ${\mathbb {P}}(\Vert {\tilde{\xi }}_{l} - {\tilde{\xi }}_i \Vert < \delta \ i.o.) = 1$, there exists a further subsequence of $\{x_{l_n}\}_{n=1}^{\infty }$, denoted $\{x_{l_n'} \}_{n=1}^{\infty }$, such that $\Vert (x^{l_n'},{\tilde{\xi }}_{l_n'}) - ({\hat{x}},{\tilde{\xi }}_i) \Vert < \delta $ for $l_n' \ge L$. Hence,

$$\begin{aligned} \begin{aligned} \vert h({\hat{x}},{\tilde{\xi }}_i) - h_{l_n'}(x^{l_n'},{\tilde{\xi }}_i) \vert&\le \vert h({\hat{x}},{\tilde{\xi }}_i) - h({\hat{x}}, {\tilde{\xi }}_{l_n'}) \vert \\&\quad \ + \vert h({\hat{x}},{\tilde{\xi }}_{l_n'}) - h_{l_n}(x^{l_n'},{\tilde{\xi }}_{l_n'}) \vert \\&\quad \ + \vert h_{l_n'}(x^{l_n'},{\tilde{\xi }}_{l_n'}) - h_{l_n'}(x^{l_n}, {\tilde{\xi }}_i)\vert \\&= \frac{2\epsilon }{3} + \vert h({\hat{x}},{\tilde{\xi }}_{l_n'}) - h_{l_n'}(x^{l_n'},{\tilde{\xi }}_{l_n'}) \vert . \end{aligned} \end{aligned}$$

(A14)

Note that $h_{l_n'}(x^{l_n'},\xi _{l_n'}) = h(x^{l_n'},\xi _{l_n'})$. It follows from (A14) that

$$\begin{aligned} \begin{aligned} \frac{2\epsilon }{3} + \vert h({\hat{x}},\xi _{l_n'}) - h_{l_n}(x^{l_n'},\xi _{l_n'}) \vert&= \frac{2\epsilon }{3} + \vert h({\hat{x}},{\tilde{\xi }}_{l_n'}) - h(x^{l_n'},{\tilde{\xi }}_{l_n'}) \vert \le \epsilon . \end{aligned} \end{aligned}$$

(A15)

Hence, by the uniqueness of the limit, it follows that $g({\hat{x}},{\tilde{\xi }}_i) = h({\hat{x}},{\tilde{\xi }}_i)$. Since i is arbitrary, $g({\hat{x}},{\tilde{\xi }}) = h({\hat{x}},{\tilde{\xi }})$ for $\mu _\xi $ almost all the $\xi $. Thus, we have

$$\begin{aligned} \lim _{n \rightarrow \infty } \frac{1}{k_{l_n}} \sum _{i=1}^{l_n} {\mathbb {I}}\left( {\tilde{\omega }}_i \in {\mathcal {S}}(k_l,\omega ; \{{\tilde{\omega }}_j \}_{j=1}^{l_n}) \right) \left( h_{l_n}(x^{l_n},{\tilde{\xi }}_{i}) - h({\hat{x}},{\tilde{\xi }}_{i})\right) = 0, \end{aligned}$$

(A16)

Since $\zeta \in \Psi _1$, equation (A16) further implies that

$$\begin{aligned} \lim _{n \rightarrow \infty } \frac{1}{k_{l_n}} \sum _{i=1}^{l_n} {\mathbb {I}}\left( {\tilde{\omega }}_i \in {\mathcal {S}}(k_l,\omega ; \{{\tilde{\omega }}_j \}_{j=1}^{l_n}) \right) h_{l_n}(x^{l_n},{\tilde{\xi }}_{i}) = {\mathbb {E}}_{{\tilde{\xi }} \sim K(\omega ,\cdot )}[h(x,{\tilde{\xi }}) \vert {\tilde{\omega }} = \omega ]. \end{aligned}$$

Since ${\hat{x}}(\zeta )$ is an arbitrary accumulation point, $\zeta \in \Psi _1 \cap \Psi _2 \cap \Psi _3$ is arbitrary and ${\mathbb {P}}(\Psi _1 \cap \Psi _2 \cap \Psi _3) = 1$, which proves the statement a in the Theorem.

It is obvious that one can mimic the proof of a. to prove b. $\square $

1.6 Proof of Lemma 4.5

Proof

By the minorant construction in the SD-kNN, we have

$$\begin{aligned} \begin{aligned}&\max \{ \alpha ^{l_n}_{l_n} + (\beta ^{l_n}_{l_n} + c)^\top {\hat{x}}^{l_n}, \alpha ^{l_n}_{-l_n} + (\beta ^{l_n}_{-l_n} + c)^\top {\hat{x}}^{l_n}\} \\&\quad \le \check{f}_{l_n}({\hat{x}}^{l_n},\omega ) \le {\hat{f}}^{k_{l_n}}_{l_n}(x^{l_n},\omega ;S^{l_n}) \end{aligned} \end{aligned}$$

which implies that

$$\begin{aligned} \begin{aligned}&\liminf _{n \rightarrow \infty } \max \{ \alpha ^{l_n}_{l_n} + (\beta ^{l_n}_{l_n} + c)^\top {\hat{x}}^{l_n}, \alpha ^{l_n}_{-l_n} + (\beta ^{l_n}_{-l_n} + c)^\top {\hat{x}}^{l_n}\} \\&\quad \le \limsup _{n \rightarrow \infty } \check{f}_{l_n}({\hat{x}}^{l_n},\omega ) \le \limsup _{n \rightarrow \infty } {\hat{f}}^{k_{l_n}}_{l_n}(x^{l_n},\omega ;S^{l_n}). \end{aligned} \end{aligned}$$

(A17)

By Theorem 4.4 and Theorem A.5, we have the following results, respectively:

$$\begin{aligned} \lim _{n \rightarrow \infty } \max \{ \alpha ^{l_n}_{l_n} + (\beta ^{l_n}_{l_n} + c)^\top {\hat{x}}^{l_n}, \alpha ^{l_n}_{-l_n} + (\beta ^{l_n}_{-l_n} + c)^\top {\hat{x}}^{l_n}\}= & {} f({\hat{x}},\omega ). \end{aligned}$$

(A18)

$$\begin{aligned} \lim _{n \rightarrow \infty } {\hat{f}}^{k_{l_n}}_{l_n}({\hat{x}}^{l_n},\omega ; S^{l_n})= & {} f({\hat{x}},\omega ). \end{aligned}$$

(A19)

Hence, the combination of equations (A17), (A18), and (A19) implies that $ \lim \limits _{n \rightarrow \infty } \check{f}_{l_n}({\hat{x}}^{l_n},\omega ) = f({\hat{x}},\omega ), \ w.p.1. $ Since $\check{f}_{l_n+1}(\cdot ,\omega )$ contains an affine function generated at ${\hat{x}}^{l_n}$ in iteration $l_n+1$, we can mimic the analysis above and get $ \lim \limits _{n \rightarrow \infty } \check{f}_{l_n+1}({\hat{x}}^{l_n},\omega ) = f({\hat{x}},\omega ), \ w.p.1.$ $\square $

1.7 Proof of Lemma 4.6

With the help of the incumbent selection rule, we first observe that $\limsup \limits _{l \rightarrow \infty } \check{f}_l(x^{l+1},\omega ) - \check{f}_l({\hat{x}}^{l},\omega ) = 0$ with probability one.

Lemma A.8

Suppose assumptions B1–B8 hold and $\sigma \ge 1$. Then with probability one, $ \limsup \limits _{l \rightarrow \infty } \check{f}_l(x^{l+1},\omega ) - \check{f}_l({\hat{x}}^{l},\omega ) = 0$.

Proof

See [28, Theorem 3]. $\square $

The proof of Lemma 4.6 is given below.

Proof of Lemma 4.6 a

By Lemma 4.1, we have $\check{f}_l(x^{l+1},\omega ) - \check{f}_l({\hat{x}}^l,\omega ) \le -\Vert x^{l+1} - {\hat{x}}^l \Vert ^2 \le 0.$ By Lemma A.8, we have

$$\begin{aligned} 0 = \limsup _{l \rightarrow \infty } \check{f}_l(x^{l+1},\omega ) - \check{f}_l({\hat{x}}^l,\omega ) \le \limsup _{l \rightarrow \infty } -\Vert x^{l+1} - {\hat{x}}^l \Vert ^2 \le 0, \quad w.p.1 \end{aligned}$$

(A20)

Equation (A20) implies that with probability one, there exists $L \subset {\mathbb {N}}$ such that $ \lim \limits _{l \in {\mathcal {L}}} x^{l+1} = \lim \limits _{l \in {\mathcal {L}}} {\hat{x}}^l = {\hat{x}}. $ $\square $

Proof of Lemma 4.6 b

The following proof follows the underlying strategy of SD [28, Lemma 4]. We let

$$\begin{aligned} (c + {\hat{\beta }}^l) \in \arg \max \{(c + \beta ^l_i)^\top d^l(x) : \alpha ^l_i + (c + \beta ^l_i)^\top x^{l+1} = \check{f}_l(x^{l+1},\omega ), i \in {\mathcal {J}}_l \}, \end{aligned}$$

and ${\hat{\alpha }}^l = \check{f}_l(x^{l+1},\omega ) - (c + {\hat{\beta }}^l)^\top x^{l+1}.$ By the construction of $\check{f}_l$, we have $\check{f}'_l(x^{l+1},\omega ; d^l(x)) = (c + {\hat{\beta }}^l)^\top d^l(x)$. By Lemma 4.6 a, with probability one, there exists ${\mathcal {L}} \subset {\mathbb {N}}$ such that $\lim \limits _{l \in {\mathcal {L}}} x^{l+1} = \lim \limits _{l \in {\mathcal {L}}} {\hat{x}}^l = {\hat{x}}$. This implies that $ \lim \limits _{l \in {\mathcal {L}}} \frac{x - x^l}{\Vert x - x^l\Vert } = \frac{x - {\hat{x}}}{\Vert x - {\hat{x}} \Vert }. $ Denote $ d(x) = \frac{x - {\hat{x}}}{\Vert x - {\hat{x}} \Vert }.$ On the other hand, note that ${\hat{\beta }}^l$ is a convex combination (i.e., $\sum _{i=1}^l \frac{1}{k_l} {\mathbb {I}}({\tilde{\omega }}_i \in {\mathcal {S}}(k_l,\omega ;\{{\tilde{\omega }}_i \}_{j=1}^l) ) = 1$, $\frac{1}{k} {\mathbb {I}}({\tilde{\omega }}_i \in {\mathcal {S}}(k_l,\omega ;\{{\tilde{\omega }}_i \}_{j=1}^l) ) \ge 0$ for all i) of elements of $\{-C(\xi )^\top \pi : \xi \in S_{\xi }, \pi \in V \}$, which is a compact set. As a result, $\limsup \limits _{l \in {\mathcal {L}}} \check{f}'_l(x^{l+1},\omega ;d^l(x))$ is finite and there exists ${\mathcal {L}}' \subset {\mathcal {L}}$ such that $ \lim \limits _{l \in {\mathcal {L}}'} {\hat{\beta }}^l = {\hat{\beta }},$ and

$$\begin{aligned} \limsup \limits _{l \in {\mathcal {L}}} \check{f}'_l(x^{l+1},\omega ;d^l(x)) = \lim \limits _{l \in {\mathcal {L}}'} \check{f}'_l(x^{l+1},\omega ;d^l(x)) = (c + {\hat{\beta }})^\top d(x). \end{aligned}$$

(A21)

By Theorem 4.4, Lemma 4.3, and Theorem A.5, we have the following results, respectively:

$$\begin{aligned}{} & {} \lim \limits _{l \in {\mathcal {L}}'} {\hat{\alpha }}^l_i + (c + {\hat{\beta }}^l)^\top x^{l+1} = f({\hat{x}},\omega ) \ w.p.1. \end{aligned}$$

(A22)

$$\begin{aligned}{} & {} {\hat{\alpha }}^l_i + (c + {\hat{\beta }}^l)^\top x \le \check{f}_l(x) \le {\hat{f}}_{l}^{k_l}(x), \ \forall \ x \in X. \end{aligned}$$

(A23)

$$\begin{aligned}{} & {} \lim \limits _{l \rightarrow \infty } {\hat{f}}_{l}^{k_l}(x,\omega ;S^l) = f(x,\omega ) \text { uniformly on } X. \end{aligned}$$

(A24)

Hence, equations (A21)–(A24) imply that $( c + {\hat{\beta }}) \in \partial _x f(x,\omega )$ with probability one. Finally, we have

$$\begin{aligned} \begin{aligned} \limsup _{l \in {\mathcal {L}}} \check{f}'_l(x^{l+1},\omega ;d^l(x)) = (c + {\hat{\beta }})^\top d(x)&\le \max \{\mu ^\top d(x) : \mu \in \partial _x f(x,\omega ) \} \\&= f'(x,\omega ; d(x)), \ w.p.1. \end{aligned} \end{aligned}$$

$\square $

Appendix B Revised benders decomposition for streaming data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Diao, S., Sen, S. Distribution-free algorithms for predictive stochastic programming in the presence of streaming data. Comput Optim Appl 87, 355–395 (2024). https://doi.org/10.1007/s10589-023-00529-5

Download citation

Received: 18 June 2022
Accepted: 31 August 2023
Published: 22 September 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s10589-023-00529-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Distribution-free algorithms for predictive stochastic programming in the presence of streaming data

Abstract

Similar content being viewed by others

Predictive stochastic programming

Multiscale stochastic optimization: modeling aspects and scenario generation

Distributions and bootstrap for data-based stochastic programming

1 Introduction

1.1 Notations

Definition 1

2 Background on non-parametric statistical estimation

Definition 2

Theorem 2.1

Definition 3

Theorem 2.2

Remark

3 Learning enabled optimization with non-parametric approximations

3.1 Mathematical formulation

Lemma 3.1

Lemma 3.2

Definition 4

Definition 5

Definition 6

Lemma 3.3

Proof

3.2 Algorithm design

Remark

3.3 Convergence analysis

Lemma 3.4

Proof

Lemma 3.5

Proof

Theorem 3.6

Proof

Remark

3.4 LEONA with kNN estimation

Corollary 3.7

Proof

3.5 LEONA with kernel estimation

Corollary 3.8

Proof

4 Non-parametric stochastic decomposition

4.1 Algorithm design

Lemma 4.1

Proposition 4.2

Proof

Remark 1

Remark 2

Lemma 4.3

4.2 Convergence results

Theorem 4.4

Proof

Corollary 4.5

Proof

Lemma 4.6

Proof

Theorem 4.7

Proof

Remark

5 Numerical experiments

5.1 Two-stage predictive SLP

5.1.1 Two-stage shipment planning problem

5.1.2 Two-stage multiproduct inventory problem

6 Conclusions

Data availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Code availability

Additional information

Publisher's Note

Appendices

Appendix A Proofs

1.1 Proof of Lemma 3.3

Proof