Abstract
Distributed stochastic zeroth-order optimization (DSZO), in which the objective function is allocated over multiple agents and the derivative of cost functions is unavailable, arises frequently in large-scale machine learning and reinforcement learning. This paper introduces a distributed stochastic algorithm for DSZO in a projection-free and gradient-free manner via the Frank-Wolfe framework and the stochastic zeroth-order oracle (SZO). Such a scheme is particularly useful in large-scale constrained optimization problems where calculating gradients or projection operators is impractical, costly, or when the objective function is not differentiable everywhere. Specifically, the proposed algorithm, enhanced by recursive momentum and gradient tracking techniques, guarantees convergence with just a single batch per iteration. This significant improvement over existing algorithms substantially lowers the computational complexity. Under mild conditions, we prove that the complexity bounds on SZO of the proposed algorithm are \(\mathcal{O}(n/\epsilon ^{2})\) and \(\mathcal{O}(n(2^{\frac{1}{\epsilon}}))\) for convex and nonconvex cases, respectively. The efficacy of the algorithm is verified on black-box binary classification problems against several competing alternatives.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
In recent years, distributed optimization has received a surge of interest in diverse areas, including autonomous vehicle control [16], multi-agent systems [31] and sensor networks [1], due to its significant advantages in aspects of data privacy, robustness, flexibility, and scalability. Distributed optimization minimizes a joint function through local computation and communication between agents in a network. Recently, much effort has been dedicated to the distributed stochastic setting [11, 19, 29, 30], where each agent’s objective function is the expectation of a function with random variables that follow unknown distributions. Such situation widely exists in the machine learning [5, 19], multi-agent reinforcement learning [25, 27, 28], and unmanned systems [7, 31], to name a few. Most distributed algorithms for solving such problems require the explicit gradients of objective functions. However, the feedback available to agents is incomplete or noisy because of the environmental uncertainty in many practical applications. Hence, the real gradient feedback seems too strict in reality.
Zeroth-order optimization is a typical gradient-free method that has gained widespread concern due to its wide usage in many practical large-scale optimization tasks. In these tasks, the explicit gradient of the objective function is expensive or unavailable to obtain, and only function evaluations are accessible. For instance, the objective function of many big data problems in complex data generation processes cannot be clearly defined. Such situations include large-scale black-box adversarial attacks to deep networks [8], simulation-based modeling [20], and reinforcement learning [24], etc. Motivated by these applications, the design and analysis of zeroth-order algorithms become increasingly popular, including distributed zeroth-order algorithms [21, 32, 34, 35] and stochastic zeroth-order algorithms [33, 36]. Nevertheless, most zeroth-order algorithms, even in centralized settings, are designed for unconstrained optimization problems or depend on projection operators for constraint sets. The projection operations may encounter an undesirable computational burden and even become computationally prohibitive for some latent group Lassos [15], e.g., \(l_{1}\) norm balls and nuclear norm balls.
Consequently, Frank-Wolfe (FW) method [10], aka conditional gradient method, has resurged because of its projection-free and computationally efficient nature. FW method avoids the projection step by accessing a linear minimization (LM) oracle, which can be effectively implemented, especially for some widespread structured constraints (see Table I in [15]). For instance, solving an LM problem over a nuclear norm ball only requires computing a single pair of singular vectors corresponding to the largest singular value, whereas projecting a point onto a nuclear norm ball demands a complete SVD decomposition. Recent years have witnessed extensive research on FW algorithms both in the centralized stochastic setting [2, 12, 18] and distributed deterministic setting [5, 6, 17]. Note that the aforementioned FW algorithms are all designed based on the first-order gradient, which cannot be directly applied to problems with only access to the value of objective functions.
FW method with stochastic zeroth-order oracle (SZO) has been recently investigated in both convex and nonconvex settings. Specifically, [4] put forth zeroth-order stochastic FW algorithms with complexity boundsFootnote 1\(\mathcal{O}(n/\epsilon ^{2})\) and \(\mathcal{O}(n/\epsilon ^{4})\) on SZO for convex and nonconvex cases, respectively. However, the algorithms in [4] require a mini-batch size related to the total number of iterations and the dimension of the problem for guaranteeing convergence. Further, [14] relaxed conditions on batch sizes via the variance reduction technique called SPIDER, and demonstrated that the algorithm achieves a lower complexity bound \(\mathcal{O}(n/\epsilon ^{3})\) on SZO for the nonconvex setting. For the convex case, [3] put forth a stochastic zeroth-order FW method, which only requires a single batch per iteration by using a momentum-based gradient tracking technique, and obtained a complexity bound \(\mathcal{O}(n/\epsilon ^{2})\) on SZO. Subsequently, [22] further extended the centralized stochastic zeroth-order FW methods to the decentralized setting, which depends on a central coordinator, and derived that the proposed algorithm has complexity bounds \(\mathcal{O}(n/\epsilon ^{3})\) and \(\mathcal{O}(n^{\frac{4}{3}}/\epsilon ^{4})\) on SZO for convex and nonconvex cases, respectively. Unfortunately, there are no efficient existing zeroth-order FW works for solving distributed stochastic optimization (DSO) problems in convex or nonconvex settings.
Motivated by the above discussions, this paper dedicates to designing a novel distributed projection-free and gradient-free algorithm for DSO problems. We provide rigorous theoretical analysis on the convergence rate and complexity guarantee of the proposed algorithm, which enjoys a convergence rate comparable to centralized stochastic first-order optimization algorithms [13], filling the theoretical gap of zeroth-order FW methods in DSO problems. Table 1 provides a comparison of the algorithms proposed in the context. The following is the main contributions of our work.
-
We put forth a Distributed Stochastic Zeroth-Order Frank-Wolfe algorithm (DSZO-FW) by using the gradient tracking technique, the momentum-based variance reduction technique, and the coordinate-wise gradient estimation. To our best knowledge, DSZO-FW is the first zeroth-order FW algorithm for DSO problems.
-
We derive sufficient conditions to guarantee the convergence of DSZO-FW under mild conditions. Specifically, DSZO-FW converges only using one batch by introducing the recursive momentum technique [9]. We establish convergence rates of \(\mathcal{O}(k^{-\frac{1}{2}})\) and \(\mathcal{O}(1/\log _{2}(k))\) for the convex and nonconvex case, respectively. The guarantee of the convex case matches the previous best-known result of centralized stochastic optimization methods.
-
For convex objective functions, we prove that DSZO-FW has a function query complexity of \(\mathcal{O}(n/\epsilon ^{2})\) for finding an ϵ-optimal solution, which coincides with that of the existing centralized best results [3, 4], and is even smaller than that of the recent decentralized FW method in [22].
-
For nonconvex objective functions, we derive that DSZO-FW has a function query complexity of \(\mathcal{O}(n(2^{\frac{1}{\epsilon}}))\) for finding an ϵ-stationary point under time-decaying step sizes. In contrast, other works [4, 14, 22] for solving such problems rely on the step sizes related to the total number of iterations.
The remaining is structured as follows. We introduce the problem and the algorithm design in Sect. 2. The convergence performance and theoretical guarantees of the proposed algorithm is presented in Sect. 3. Section 4 takes several simulation experiments to validate the efficacy of the algorithm. Section 5 concludes the work. Appendix provides some technical proofs of the paper.
Notations
The notations used in this paper are fairly standard. Specifically, we denote \(\mathbb{R}\) as a set of real numbers, and \(\mathbb{R}_{+}\) as a set of nonnegative real numbers. Symbols \(\langle \cdot \rangle \) and \(\lceil \cdot \rceil \) denote the inner product and the ceiling operation, respectively. In addition, \(\mathbb{R}^{p}\) is the set of p-dimensional real vectors. Consider a vector \(v\in \mathbb{R}^{p}\). We write \(\|v\|_{q}\) for the \(l_{q}\) norm of v and \(\|v\|\) for the Euclidean norm of v. We write \(\mathbb{E}[\cdot ]\) to denote the expectation operator; moreover, \(\mathbb{E}[\cdot |\mathcal{F}_{k}]\) represents the conditional expectation on the σ-field \(\mathcal{F}_{k}\). Finally, \(W=[w_{ij}]_{N\times N}\) is the weighted adjacency matrix of a topology graph \(\mathcal{G}(\mathcal{N},\mathcal{E})\), where \(\mathcal{N}=\{1,2,\ldots ,N\}\) is a set containing of N agents, and \(\mathcal{E}\subseteq \mathcal{N}\times \mathcal{N}\) is a set of edges. For any \(i,j\in \mathcal{N}\), if \((i,j)\in \mathcal{E}\), then \(w_{ij}>0\), otherwise \(w_{ij}=0\).
2 Problem statement and algorithm design
2.1 Problem statement
Consider a set of agents \(\mathcal{N}=\{1,2,\ldots ,N\}\) over an undirected network \(\mathcal{G}=\{\mathcal{N},\mathcal{E}\}\), where \(\mathcal{E}\subseteq \mathcal{N}\times \mathcal{N}\) is a set of edges. These agents aim to collaborate to find an optimal solution \(x^{*}\) of the problem
where \(x\in \mathbb{R}^{n}\) is the strategy variable, and \(\mathcal{X}\subseteq \mathbb{R}^{n}\) is a compact and convex set. The function \(H_{i}(x):=\mathbb{E}_{\xi ^{i}}[h_{i}(x,\xi ^{i})]\) is a local objective function, and \(h_{i}:\mathcal{X}\times \mathbb{R}^{p}\rightarrow \mathbb{R}\) is a function involving random variable \(\xi ^{i}\) with an unknown distribution. The randomness \(\xi ^{i}\) can be viewed as a random sample inserted by algorithms or as measurement noise inherent in systems. Here, we assume that the gradient of the objective function \(H_{i}(\cdot )\) is expensive or infeasible to obtain and agent \(i\in \mathcal{N}\) is only able to access a stochastic approximation of the real objective value \(h_{i}(x,\xi ^{i})\) for any given x and \(\xi ^{i}\).
2.2 Algorithm design
We propose a Distributed Stochastic Zeroth-Order Frank-Wolfe algorithm (DSZO-FW), which is summarized in Algorithm 1. To measure the convergence performance of DSZO-FW, we introduce the following two oracle complexities and a performance measure.
-
Stochastic Zeroth-order Oracle (SZO): SZO returns a function value \(h_{i}(x,\xi ^{i})\) for given \(x\in \mathbb{R}^{n}\) and \(\xi ^{i}\in \mathbb{R}^{p}\).
-
Linear Minimization Oracle (LMO): LMO solves a linear optimization problem, and returns \(\operatorname{argmin}_{\phi \in \mathcal{X}}\langle s,\phi \rangle \) for given direction s and constraint set \(\mathcal{X}\).
-
ϵ-optimal solution: Let \(x^{*}\in \mathcal{X}\) be an optimal solution of problem (1). If \(h(x)-h(x^{*})\leqslant \epsilon \), then \(x\in \mathcal {X}\) is an ϵ-optimal solution of problem (1).
Due to the unavailability of the gradient information for objective functions, agent i estimates the gradient \(\nabla h_{i}(x^{i},\xi ^{i})\) by using a coordinate-wise gradient estimator [3, 14]:
where \(\rho >0\) denotes the element-wise smoothing parameter, and \(e_{j}\in \mathbb{R}^{n}\) is a standard basis vector with \([e_{j}]_{i}=1\) if \(i=j\), otherwise \([e_{j}]_{i}=0\). We convert the estimator (8) to the following expression at an iteration k in Algorithm 1:
where \(\{\rho _{k}\}_{k=1}^{\infty}\) is a decreasing sequence of positive real numbers.
In Algorithm 1, each agent uses SZO rather than the gradient information and mainly executes four steps. Here, we briefly introduce the process of the ith agent’s kth iteration.
-
Step 1: Agent i takes a weighted average of values from its neighbors on the basis of W, and uses \(\bar{x}^{i}_{k}\) to approximate the average iterate. The specific description is provided in (2).
-
Step 2: Agent i estimates the gradient by using the coordinate-wise gradient estimator (9). To address the non-vanishing variance caused by the gradient estimation, the paper introduces a modified momentum-based variance reduction method, aka recursive momentum [9], into the distributed stochastic Frank-Wolfe (FW) algorithm. The specific expression is described in (3).
-
Step 3: Agent i approximates the global gradient by using the gradient tracking technique, which reuses the global gradient estimation \(y^{i}_{k-1}\) from the previous iteration via (4) and (5).
-
Step 4: To avoid projection operations, agent i updates the iterate by firstly solving a linear minimization problem (6) to obtain a conditional gradient \(z^{i}_{k}\), and then makes a convex combination with the average iterate approximation \(\bar{x}^{i}_{k}\) in (7).
Remark 1
The employment of zeroth-order gradients, also known as derivative-free optimization methods, brings forth both unique challenges and potential advantages. One of the main challenges with zeroth-order methods is their high requirement of function evaluations compared to first-order methods, leading to the gradient variance and higher computational costs. To address this issue, this paper incorporates recursive momentum techniques into a gradient-tracking distributed framework to reduce the non-vanishing variance caused by the gradient estimation. Remarkably, the proposed distributed zeroth-order algorithm can not only attenuate the noise in gradient approximation by only using single batch, but also achieve a comparable function query complexity to the existing centralized best result in convex case. The most significant advantage of using zeroth-order gradients is the ability to optimize functions without the need for gradient information, making it applicable to a wider range of problems where gradients are difficult or impossible to compute.
Remark 2
In Algorithm 1, we introduce the recursive momentum technique into the distributed zeroth-order FW method for reducing the variance caused by gradient estimates, as described in (3). Specifically, we rewrite (3) as
The second term \(\hat{\nabla}h_{i}(\bar{x}_{k},\xi ^{i}_{k})-\hat{\nabla}h_{i}(\bar{x}_{k-1}, \xi ^{i}_{k})+g^{i}_{k-1}\) plays an important role in reducing variance caused by the gradient estimation. In addition, the recursive momentum technique allows Algorithm 1 to converge with only one sample at each iteration, unlike the algorithms in [4] and [22], which require large batches. Hence, Algorithm 1 is also well-competent to large-scale finite-sum optimization problems.
Remark 3
In Algorithm 1, the FW step ((6)–(7)) circumvents the projection operation by minimizing a linear optimization subproblem (6) over a constraint set \(\mathcal{X}\). When constraint sets are structural constraints such as nuclear and \(l_{1}\) norm balls, (6) provides an efficient implementation or even a closed-form solution [15], resulting in a cheaper computational cost compared with the projection step. For example, if \(\mathcal{X}\) is an \(l_{1}\) norm ball (\(\mathcal{X}:=\{x|\|x\|_{1} \leqslant d\}\)), the FW step allows for a closed-form solution \(z^{i}_{k}=d\cdot [0,\ldots ,0,-\operatorname{sgn}[s^{i}_{k}]_{h},0,\ldots ,0]^{ \mathrm{T}}\) with \(h=\operatorname{argmax}_{j}|[s^{i}_{k}]_{j}|\) in Algorithm 1. Moreover, when \(\mathcal{X}\) is a nuclear norm ball, solving (6) requires computing only a single pair of singular vectors corresponding to the largest singular value, whereas computing a projection onto \(\mathcal{X}\) demands a complete SVD decomposition.
3 Assumptions and convergence analysis
This section dedicates to analyzing the convergence performance of Algorithm 1. Before providing main results, several standard assumptions are required.
3.1 Assumptions and facts
Assumption 1
The network \(\mathcal{G}\) is connected.
Assumption 2
The weighted adjacency matrix W is doubly stochastic.
Assumptions 1 and 2 indicate that for each round of the Step 1 in Algorithm 1, the agent takes a weighted average of the values from its neighbors according to W. In addition, these assumptions [26] also imply that the matrix W’s second largest eigenvalue λ satisfies \(|\lambda |<1\). The following fact is true under Assumptions 1 and 2 [26].
Fact 1
Let \(\bar{x}=\frac{1}{N} \sum_{i=1}^{N} x^{i}\) and \(\bar{x}^{i}=\sum_{j=1}^{N} w_{ij}x^{j}\). Then, \((\sum_{i=1}^{N} \|\bar{x}^{i} - \bar{x}\|^{2} )^{ \frac{1}{2}}\leqslant |\lambda | ( \sum_{i=1}^{N} \|x^{i} - \bar{x}\|^{2} )^{\frac{1}{2}}\).
Fact 1 suggests that each update in the average consensus process (Step 1) incrementally aligns the iteration variables more closely with their mean value x̄. To streamline our convergence analysis, we introduce \(k_{0}\in \mathbb{R}_{+}\) as the smallest integer such that \(|\lambda |\leqslant [k_{0}/(k_{0}+1)]^{2}\). Clearly, \(k_{0}=\lceil (|\lambda |^{-\frac{1}{2}}-1)^{-1}\rceil \).
Assumption 3
\(H_{i}(\cdot )\) and \(h_{i}(\cdot ,\xi ^{i})\) are L-smooth functions on the constraint set \(\mathcal{X}\) for all \(i\in \mathcal{N}\) and \(\xi ^{i}\in \mathbb{R}^{p}\).
Furthermore, we posit an additional assumption regarding the constraint set \(\mathcal{X}\), which forms a foundational element in the context of FW-based methods [3, 4, 14, 22].
Assumption 4
\(\mathcal{X}\) is compact and convex, that is, \(\|x-y\|\leqslant d\) for all \(x,y\in \mathcal{X}\), where d is a positive constant.
Assumption 5
The variance of \(\nabla h_{i}(x,\xi ^{i})\) is bounded for all \(x\in \mathcal{X}\) and \(i\in \mathcal{N}\). That is, there exists a constant δ such that \(\mathbb{E}[\|\nabla h_{i}(x,\xi ^{i})-\nabla H_{i}(x)\|^{2}] \leqslant \delta ^{2}\), where \(H_{i}(x)=\mathbb{E}[h_{i}(x,\xi ^{i})]\).
Fact 2
(see [13])
If Assumptions 4–5hold, there is a positive constant l such that \(\mathbb{E}[\|\nabla h_{i}(x,\xi ^{i})\|^{2}]\leqslant l^{2}\) and \(\mathbb{E}[\|\nabla h_{i}(x, \xi ^{i})\|]\leqslant l\).
Assumptions 3–5 are standard assumptions in stochastic FW methods [3, 4, 9, 13, 14, 22]. If Assumption 3 holds, the following fact is true.
Fact 3
Define \(\hat{\nabla}H_{i}(x^{i}):=\sum^{n}_{j=1} \frac{H_{i}(x^{i}+\rho e_{j})-H_{i}(x^{i}-\rho e_{j})}{2\rho}e_{j}= \mathbb{E}[{\hat{\nabla}h_{i}(x^{i},\xi ^{i})}]\), where \(\hat{\nabla}h_{i}(x^{i},\xi ^{i})\) defined in (8). Then, for any \(x^{i}\in \mathcal{X}\) (\(i\in \mathcal{N}\)) and \(\xi ^{i}\in \mathbb{R}^{p}\),
Proof
We first prove (11). It follows from the definition of \(\hat{\nabla}h_{i}(x^{i},\xi ^{i})\) and the mean value theorem to \(\nabla h_{i}(x^{i},\xi ^{i})\) that there exists \(\alpha _{j}\in (0,1)\) such that
It follows from the property of the basis vector \(e_{j}\) and Euclidean norm that
where we use Assumption 3 in the second inequality. We obtain Eqn. (12) in a similar way. □
Fact 4
(see [13])
For any vectors \(v_{1},\ldots , v_{N}\in \mathbb{R}^{n}\),
Assumptions 1–5 and Facts 2–4 are crucial to the subsequent analysis. They serve as the theoretical groundwork upon which our analysis is constructed, ensuring a rigorous foundation for the methodologies employed and the conclusions drawn.
3.2 Convergence analysis
For the convenience of analysis, we define
The following lemma estimates the tracking error for the average iterate in Algorithm 1, and we provide the proof in Appendix 1.2.
Lemma 1
Let \(\gamma _{k}=\frac{2}{k+2}\). If Assumptions 1, 2and 4hold, then, for any \(i\in \mathcal{N}\) and \(k\geqslant 1\), \(\|\bar{x}^{i}_{k}-\bar{x}_{k}\|\leqslant \frac{2C_{1}}{k+2}\) and \(\|\bar{x}^{i}_{k+1}-\bar{x}^{i}_{k}\|\leqslant \frac{2(d+2C_{1})}{k+2}\), where \(C_{1}\) is defined in Table 2and \(k\geqslant 1\).
Lemma 1 shows that the averaged iterate estimation \(\bar{x}^{i}_{k}\) approximates to the real average value \(\bar{x}_{k}\) at a rate of \(\mathcal{O}(1/k)\).
We provide the performance of the averaged gradient tracking for Algorithm 1 in the following lemma. Appendix 1.4 presents the proof of Lemma 2.
Lemma 2
Suppose Assumptions 1–5hold. If \(\beta _{k}=\frac{2}{k+1}\), \(\gamma _{k}=\frac{2}{k+2}\) and \(0<\rho _{k}\leqslant \frac{d+2C_{1}}{\sqrt{n}(k+2)}\), then
where \(C_{2}\) is defined in Table 2and \(k\geqslant 1\).
Lemma 2 establishes that \(\mathbb{E}[\|\bar{g}_{k}-s^{i}_{k}\|^{2}]=\mathcal{O}(1/k^{2})\), which implies that \(\|\bar{g}_{k}-s^{i}_{k}\|\) converges to zero as \(k\rightarrow +\infty \) in expectation.
The following lemma plays an important role in the convergence analysis of Algorithm 1.
Lemma 3
Define \(\hat{\nabla}\bar{h}_{k}:=\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}_{k}[ \hat{\nabla}h_{i}(\bar{x}^{i}_{k},\xi ^{i}_{k})]\). If Assumptions 1–5hold, the following two relations are established.
1) For any \(k\geqslant 1\), it holds that
2) If \(\beta _{k}=\frac{2}{k+1}\), \(\gamma _{k}=\frac{2}{k+2}\), and \(\rho _{k}\leqslant \frac{d+2C_{1}}{\sqrt{n}(k+2)}\), then for any \(k\geqslant 1\),
where \(C_{3}\) and \(C_{1}\) are defined in Table 2.
The proof of Lemma 3 is provided in Appendix 1.5.
Lemma 3 shows that the variable \(\bar{g}_{k}\) tracks the real average gradient \(\bar{p}_{k}\) with an average error bounded by \(\mathcal{O}(\frac{C_{3}+L^{2}(d+C_{1})^{2}}{k+2})\). That is, the expected error of the approximation in stochastic gradient diminishes as the number of iterations increases. Making use of Lemmas 2 and 3, the following lemma is established.
Lemma 4
Choose \(\beta _{k}=\frac{2}{k+1}\), \(\gamma _{k}=\frac{2}{k+2}\), and \(0<\rho _{k}\leqslant \frac{d+2C_{1}}{\sqrt{n}(k+2)}\). If Assumptions 1–5hold, then, for any \(k\geqslant 1\) and \(i\in \mathcal{N}\),
The proof is presented in Appendix 1.6.
The following two theorems establish convergence rates of Algorithm 1 for convex and nonconvex objectives, respectively.
Theorem 1
(Convex objective) Let Assumptions 1–5hold. Choose \(\beta _{k}=\frac{2}{k+1}\), \(\gamma _{k}=\frac{2}{k+2}\), and \(0<\rho _{k}\leqslant \frac{d+2C_{1}}{\sqrt{n}(k+2)}\). If \(h_{i}(\cdot ,\xi ^{i})\) is convex for any \(i\in \mathcal{N}\) and \(\xi ^{i}\), then
where \(C_{4}\) is defined in Table 2.
The proof of Theorem 1 is presented in Appendix 1.7.
Theorem 1 indicates that the convergence rate of Algorithm 1 is \(\mathcal{O}(1/k^{\frac{1}{2}})\). The result can be directly translated into finding an ϵ-optimal solution to problem (1). The numbers of calls to SZO and LMO for ϵ-optimal solutions are \(\mathcal{O}(\frac{nC^{2}_{4}}{\epsilon ^{2}})\) and \(\mathcal{O}(\frac{C^{2}_{4}}{\epsilon ^{2}})\), respectively.
For the nonconvex case, we introduce a convergence criterion used for standard FW methods, aka FW-gap [4, 13, 14, 22], which is
Based on the convergence measure (18), we establish the following theorem for problem (1) with nonconvex objective functions.
Theorem 2
(Nonconvex objective) Suppose Assumptions 1–5hold. Choose \(\beta _{k}=\frac{2}{k+1}\), \(\gamma _{k}=\frac{2}{k+2}\), and \(0<\rho _{k}\leqslant \frac{d+2C_{1}}{\sqrt{n}(k+2)}\). Then,
where \(c\in \mathbb{R}\) satisfies \(\sum_{k=1}^{2^{m}}(4d/(k+2)^{\frac{3}{2}})\leqslant c\).
The proof of Theorem 2 is presented in Appendix 1.8.
Theorem 2 shows that Algorithm 1 converges to a stationary point at a rate of \(\mathcal{O}(1/\log _{2}(K))\) when the objective function is nonconvex. The total number of calls to SZO and LMO are \(\mathcal{O}(2^{\frac{\Gamma}{\epsilon}}d)\) and \(\mathcal{O}(2^{\frac{\Gamma}{\epsilon}})\) for finding an ϵ-stationary point, respectively.
Remark 4
Table 1 shows that both the number of calls and the function query-size to SZO of Algorithm 1 are significantly less than those in ZSCG and ZSAGMIU [4], at the cost of a larger complexity bound on LMO. In addition, Algorithm 1 has the same complexity bounds for both SZO and LMO as those in the recently proposed centralized method MOST-FW [3]. Compared with the existing decentralized zeroth-order FW method DSGFF [22], which requires a central coordinator, the fully distributed Algorithm 1 has a lower complexity bound of SZO in the convex case and a weaker dimensional dependency of SZO in the nonconvex case.
Remark 5
It is worth noting that the step sizes we use are monotone decreasing, different from the existing zeroth-order nonconvex FW methods [4, 14, 22]. The step sizes mentioned in these references depend on the total iteration number K and the dimension of the variable.
4 Numerical simulations
In this section, we apply Algorithm 1 (DSZO-FW) to solve a black-box distributed stochastic binary classification problem with convex and nonconvex objectives, respectively. To solve such problems, DSZO-FW is applied over a connected network \(\mathcal{G}\) with \(N=5\) agents and a doubly stochastic adjacency matrix W. The communication graph is a ring topology, and each agent only accesses its own objective function \(h_{i}\). We construct matrix W by using maximum-degree weights. Specifically, the maximum degree of ring topology is \(d_{\max}=2\). For any edge \((i,j)\) in the graph, the weight \(w_{ij}\) is set as \(w_{ij}=1/(1+d_{\max})\) for all \(i\neq j\). The diagonal elements \(w_{ii}\) are then set to make the rows sum up to 1, which typically results in \(w_{ii}=1-\sum_{j\in \mathcal{N}_{i}}w_{ij}\), where \(\mathcal{N}_{i}\) denotes the set of neighbors of node i. We set the constraint set to an \(l_{1}\)-norm ball such that \(\mathcal{X}=\{x|\|x\|_{1}\leqslant d\}\). Here we assume \(d=5\).
For better evaluating the performance of DSZO-FW, we compare it against centralized algorithms ZSCG [4], SGFFW [23], and MOST-FW [3] as baselines. In the experiments, we use three public datasetsFootnote 2 (covtype.binary, a9a and w8a) and suppose that each iteration randomly obtains only 1% of data. Because a large batch size \(m_{k}\) (related to the dimension and the total number of iterations) required by ZSCG exceeds the total number of samples in these three datasets, we regard ZSCG as a deterministic algorithm in the experiment, which uses full data to compute the function value. We evaluate these four algorithms according to the FW-gap, which is defined in (18).
4.1 Black-box binary classification with convex objectives
This subsection dedicates to verifying the theoretical results of DSZO-FW in the convex case. Our goal is to find an optimal solution \(x\in \mathbb{R}^{n}\) by solving the following stochastic binary classification problem:
where \((a_{ij},b_{ij})^{m_{i}}_{j=1}\) are \(m_{i}\) (feature, label) pairs randomly obtained by agent i from the dataset. For benchmark, we set step sizes of these four algorithms to the same values as their theoretical results in the convex setting, i.e., \(\alpha _{k}=6/(k+5)\) for ZSCG [4]; \(\rho _{k}=4/(k+8)^{\frac{2}{3}}\), \(\gamma _{k}=2/(k+8)\) and \(c_{k}=2/(n^{\frac{1}{2}}(k+8)^{\frac{1}{3}})\) for SGFFW [23]; \(\gamma _{k}=1/k\), \(\eta _{k}=2/(k+1)\), \(\mu _{k}=0\) and \(\rho _{k}=d/\sqrt{n}(k+1)\) for MOST-FW [3]; \(\beta _{k}=2/(k+1)\), \(\gamma _{k}=2/(k+2)\) and \(\rho _{k}=d/\sqrt{n}(k+2)\) for DSZO-FW.
Figure 1 shows the convergence performance of these four algorithms on a convex binary classification problem. We observe that DSZO-FW and MOST-FW perform a smaller FW-gap than ZSCG and SGFFW, especially on dataset \(w8a\), although they use less data than ZSCG. This dedicates that the local gradient estimate via the recursive momentum technique might be a better candidate for approximating the gradient. We observe the periodic vibrate on the curves of these four algorithms, especially on datasets \(a9a\) and \(w8a\). We intuitively believe that this phenomenon occurs due to the imprecise estimation of the gradient estimator and the gradient variance reduced period via the variance reduction technique.
4.2 Black-box binary classification with nonconvex objectives
In this subsection, we dedicate to verifying the theoretical results of DSZO-FW in the nonconvex case. Consider the following stochastic binary classification problem with nonconvex objective functions:
where \((a_{ij},b_{ij})^{m_{i}}_{j=1}\) are \(m_{i}\) (feature, label) pairs randomly obtained by agent i from the dataset. For benchmark, we set step sizes of these four algorithms to the same values as their theoretical results in the nonconvex setting, i.e., \(\alpha _{k}=1/T^{\frac{1}{2}}\) for ZSCG [4]; \(\gamma _{k}=1/T^{\frac{3}{4}}\), \(\rho _{k}=4/((k+8)^{\frac{2}{3}}(1+n)^{\frac{1}{3}})\), and \(c_{k}=2/(n^{\frac{3}{2}}(k+8)^{\frac{1}{3}})\) for SGFFW [23]; \(\gamma _{k}=1/k\), \(\eta _{k}=2/(k+1)\), \(\mu _{k}=0\), and \(\rho _{k}=d/\sqrt{n}(k+1)\) for MOST-FW [3]; \(\beta _{k}=2/(k+1)\), \(\rho _{k}=d/\sqrt{n}(k+2)\), and \(\gamma _{k}=2/(k+2)\) for DSZO-FW. Note that MOST-FW is not proven to be convergent for the nonconvex case. We implement the algorithm only for comparison purposes.
Figure 2 shows the convergence performance measured by FW-gap of these four algorithms on a nonconvex binary classification problem. The results show that DSZO-FW converges faster than ZSCG and SGFFW in both three datasets. In contrast, DSZO-FW has a comparable convergence performance to MOST-FW on datasets \(a9a\) and \(w8a\), demonstrating the efficacy of the variance reduction technique used in DSZO-FW and MOST-FW. Similar to Fig. 1, the periodic vibrate on the curves of these four algorithms also appears, especially on datasets \(a9a\) and \(w8a\). We infer that this phenomenon occurs because the variance of the gradient estimator is too high in these two cases.
5 Conclusions
This paper proposed a novel algorithm in a projection-free and gradient-free manner for distributed stochastic optimization problems accessing only the stochastic zeroth-order oracle (SZO). The proposed algorithm only requires a single batch size to guarantee convergence using recursive momentum and gradient tracking techniques. We proved that the proposed algorithm has the comparable complexity bound \(\mathcal{O}(n/\epsilon ^{2})\) on SZO as that of the centralized best results for the convex case. For the nonconvex case, the algorithm has a complexity bound \(\mathcal{O}(n/(2^{\frac{1}{\epsilon}}))\) on SZO under mild conditions. The efficacy of the proposed algorithm is demonstrated through simulation experiments on multiple datasets. Our future works include extending the algorithm to stochastic nonsmooth optimization problems and introducing variance reduction techniques to obtain a better convergence performance.
Data availability
Not applicable.
Notes
The following results are normalized to find an ϵ-optimal solution for convex optimization problems and ϵ-stationary point for nonconvex optimization problems. The symbol n denotes the dimension of the strategy variable.
Available at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.
References
S. Aeron, V. Saligrama, D.A. Castanon, Efficient sensor management policies for distributed target tracking in multihop sensor networks. IEEE Trans. Signal Process. 56(6), 2562–2574 (2008)
Z. Akhtar, K. Rajawat, Momentum based projection free stochastic optimization under affine constraints, in American Control Conf. (2021), pp. 2619–2624
Z. Akhtar, K. Rajawat, Zeroth and first order stochastic Frank-Wolfe algorithms for constrained optimization. IEEE Trans. Signal Process. 70, 2119–2135 (2022)
K. Balasubramanian, S. Ghadimi, Zeroth-order (non)-convex stochastic optimization via conditional gradient and gradient updates, in Proc. Int. Conf. Neural Inf. Process. Syst. (2018), pp. 3459–3468
A. Bellet, Y. Liang, A.B. Garakani et al., A distributed Frank-Wolfe algorithm for communication-efficient sparse learning, in Proc. SIAM Int. Conf. Data Mining (2015), pp. 478–486. https://doi.org/10.1137/1.9781611974010.54
G. Chen, P. Yi, Y. Hong et al., Distributed optimization with projection-free dynamics: a Frank-Wolfe perspective. IEEE Trans. Cybern. 54(1), 599–610 (2024). https://doi.org/10.1109/TCYB.2023.3284822
J. Chen, J. Sun, G. Wang, From unmanned systems to autonomous intelligent systems. Engineering 12(5), 16–19 (2022)
P. Chen, H. Zhang, Y. Sharma et al., ZOO: zeroth order optimization based black-box attacks to deep neural networks without training substitute models, in Proc. ACM. Work. Artif. Intell. Sec. (2017), pp. 15–26
A. Cutkosky, F. Orabona, Momentum-based variance reduction in non-convex SGD, in Proc. Adv. Neural Inf. Process. Syst. (2019), pp. 15210–15219
M. Frank, P. Wolfe, An algorithm for quadratic programming. Nav. Res. Logist. 3(1–2), 95–110 (1956)
K. Fu, H. Chen, W. Zhao, Distributed dynamic stochastic approximation algorithm over time-varying networks. Auton. Intell. Syst. 1(5) (2021). https://doi.org/10.1007/s43684-021-00003-1
E. Hazan, H. Luo, Variance-reduced and projection-free stochastic optimization, in Proc. Int. Conf. Mach. Learn (2016)
J. Hou, X. Zeng, G. Wang et al., Distributed momentum-based Frank-Wolfe algorithm for stochastic optimization. IEEE/CAA J. Autom. Sin. 10(3), 676–690 (2023)
F. Huang, S. Chen, Accelerated stochastic gradient-free and projection-free methods, in Proc. Int. Conf. Mach. Learn. (2020), pp. 4519–4530
M. Jaggi, Revisiting Frank-Wolfe: projection-free sparse convex optimization, in Proc. Int. Conf, Mach. Learn., Atlanta, GA, USA (2013), pp. 427–435
Y. Kuriki, T. Namerikawa, Consensus-based cooperative formation control with collision avoidance for a multi-UAV system, in American Control Conf. (2014), pp. 2077–2082
D. Li, N. Li, L. Lewis, Projection-free distributed optimization with nonconvex local objective functions and resource allocation constraint. IEEE Trans. Control Netw. Syst. 8(1), 413–422 (2021)
A. Mokhtari, H. Hassani, A. Karbasi, Stochastic conditional gradient methods: from convex minimization to submodular maximization. J. Mach. Learn. Res. 21(105), 1–49 (2020)
S. Pu, A. Olshevsky, I.C. Paschalidis, Asymptotic network independence in distributed stochastic optimization for machine learning: examining distributed and centralized stochastic gradient descent. IEEE Signal Process. Mag. 37(3), 114–122 (2020)
R. Rubinstein, D. Kroese, Simulation and the Monte Carlo Method, vol. 10 (Wiley, New York, 2016)
A. Sahu, D. Jakovetic, D. Bajovic et al., Distributed zeroth order optimization over random networks: a Kiefer-Wolfowitz stochastic approximation approach, in IEEE Conf. Decision Contr. (2018), pp. 4951–4958. https://doi.org/10.1109/CDC.2018.8619044
A. Sahu, S. Kar, Decentralized zeroth-order constrained stochastic optimization algorithms: Frank–Wolfe and variants with applications to black-box adversarial attacks. Proc. IEEE 108(11), 1890–1905 (2020)
A. Sahu, M. Zaheer, S. Kar, Towards gradient free and projection free stochastic optimization, in Proc. Int. Conf. Artif. Intell. Statis. (2019), pp. 3468–3477
T. Salimans, J. Ho, X. Chen et al., Evolution strategies as a scalable alternative to reinforcement learning (2017). arXiv preprint https://doi.org/10.48550/arXiv.1703.03864
P. Sun, Z. Guo, G. Wang et al., MARVEL: enabling controller load balancing in software-defined networks with multi-agent reinforcement learning. Comput. Netw. 177, 107230 (2020)
H. Wai, J. Lafond, A. Scaglione et al., Decentralized Frank-Wolfe algorithm for convex and nonconvex problems. IEEE Trans. Autom. Control 62(11), 5522–5537 (2017)
D. Wang, Z. Wang, Z. Wu, Distributed convex optimization for nonlinear multi-agent systems disturbed by a second-order stationary process over a digraph. Sci. China Inf. Sci. 65, 132201 (2022). https://doi.org/10.1007/s11432-020-3111-4
G. Wang, S. Lu, G.B. Giannakis et al., Decentralized TD tracking with linear function approximation and its finite-time analysis, in Proceedings of the 34th International Conference on Neural Information Processing Systems, vol. 1154 (2020), pp. 13762–13772
Z. Wang, J. Zhang, T. Chang et al., Distributed stochastic consensus optimization with momentum for nonconvex nonsmooth problems. IEEE Trans. Signal Process. 69, 4486–4501 (2021)
Y. Xu, H. Deng, W. Zhu, Synchronous distributed admm for consensus convex optimization problems with self-loops. Inf. Sci. 614, 185–205 (2022)
R. Yang, L. Liu, G. Feng, An overview of recent advances in distributed coordination of multi-agent systems. Unmanned Syst. 10(03), 307–325 (2022)
X. Yi, S. Zhang, T. Yang et al., Linear convergence of first- and zeroth-order primal–dual algorithms for distributed nonconvex optimization. IEEE Trans. Autom. Control 67(8), 4194–4201 (2022)
X. Yi, S. Zhang, T. Yang et al., Zeroth-order algorithms for stochastic distributed nonconvex optimization. Automatica 142, 110353 (2022)
Z. Yu, D.W. Ho, D. Yuan, Distributed randomized gradient-free mirror descent algorithm for constrained optimization. IEEE Trans. Autom. Control 67(2), 957–964 (2022)
D. Yuan, B. Zhang, D.W. Ho et al., Distributed online bandit optimization under random quantization. Automatica 146, 110590 (2022)
S. Zhang, C.P. Bailey, Accelerated zeroth-order algorithm for stochastic distributed non-convex optimization, in American Contr. Conf. (2022), pp. 4274–4279. https://doi.org/10.23919/ACC53348.2022.9867306
Acknowledgements
The authors would like to thank the anonymous reviewers and potential users for their valuable comments and suggestions.
Funding
This work was supported by the National Natural Science Foundation of China under Grant Nos. 62222303, 62073035, and 62088101.
Author information
Authors and Affiliations
Contributions
All authors contributed to the design and implementation of the research. Material preparation and analysis were completed by Jie Hou, Xianlin Zeng and Chen Chen. Jie Hou and Xianlin Zeng contributed to the problem formulation, discussion of ideas, mathematical derivation and proof of results. Chen Chen contributed to the problem formulation and discussion of ideas. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
1.1 1.1 Technical lemmas for Lemma 1
We first provide some technical lemmas before proving Lemma 1.
Lemma 5
(Lemma 2, [2]) Let \(\{\Pi _{k}\}\) be a sequence of real numbers such that
for some \(a_{1}\in [0,1]\) satisfying \(a_{1}\leqslant a_{2}\leqslant 2a_{1}\), \(A_{1}> 1\) and \(A_{2}\geqslant 0\). Then \(\Pi _{k}\) converges to zero at a rate of
where \(A=\max \{\Pi _{0}(t_{0}+1)^{a_{2}-a_{1}}, \frac{A_{2}}{A_{1}-1} \}\).
Lemma 6
For all \(k=1,2,\ldots ,K\), if Assumption 1holds, we have the following relationships:
(a) \(\frac{1}{N}\sum_{i=1}^{N}y_{k+1}^{i}=\bar{g}_{k+1}\);
(b) \(\bar{x}_{k+1}=(1-\gamma _{k})\bar{x}_{k}+\gamma _{k}\bar{z}_{k}\), where \(\bar{z}_{k}=\frac{1}{N}\sum_{i=1}^{N}z_{k}^{i}\).
Proof
(a) It follows from (4) of Algorithm 1 that
where the fact that the matrix W is doubly stochastic is used in the second equality. Hence, \(\frac{1}{N}\sum_{i=1}^{N}y_{k+1}^{i}=\bar{g}_{k+1}\).
(b) According to the definitions of \(\bar{x}_{k}\) and \(x_{k}^{i}\),
where the fact that the matrix W is doubly stochastic is used in the first equality. The proof is completed. □
1.2 1.2 Proof of Lemma 1
Proof
In the first step, we prove that \(\|\bar{x}^{i}_{k}-\bar{x}_{k}\|\leqslant C_{1}\gamma _{k}\).
We derive that \(\|\bar{x}_{k}^{i}-\bar{x}_{k}\|\leqslant \max_{i\in \mathcal{N}}\|\bar{x}_{k}^{i}-\bar{x}_{k}\|\leqslant ( \sum_{i=1}^{N}\|\bar{x}_{k}^{i}-\bar{x}_{k}\|^{2})^{\frac{1}{2}}\) from the properties of Euclidean norm. Next, we make a proof on the following inequality by using induction on k,
It can be observed that (19) holds for \(k=1\) to \(k=k_{0}-2\).
We assume that (19) holds for some \(k\geqslant k_{0}-2\) in the induction step. It follows from Lemma 6 (b) and (7) that
where λ is the second largest eigenvalue of W, and we use Fact 1 in the last inequality. Next, we provide an upper bound on \(\sum_{i=1}^{N}\|(1-\gamma _{k})(\bar{x}_{k}^{i}-\bar{x}_{k})+ \gamma _{k}(z_{k}^{i}-\bar{z}_{k})\|^{2}\).
where (a) holds because of Assumption 4; (b) is due to \(1-\gamma _{k}\leqslant 1\); (c) follows from \(\sum_{i=1}^{N}\|\bar{x}_{k}^{i}-\bar{x}_{k}\|\leqslant \sqrt{N} \sqrt{\sum_{i=1}^{N}\|\bar{x}_{k}^{i}-\bar{x}_{k}\|^{2}}\) and the induction hypothesis (19). Substituting (21), \(\lambda \leqslant (\frac{k_{0}}{k_{0}+1} )^{2}\) and \(\gamma _{k}=\frac{2}{k+2}\) into (20), it has
where we use the monotonically increasing property of function \(g(x)=x/(1+x)\) with respect to x over \([0,\infty )\) in the second inequality. We obtain \(\sum_{i=1}^{N}\|\bar{x}_{k+1}^{i}-\bar{x}_{k+1}\|_{2}\leqslant C_{1} \gamma _{k+1}\) by (22). That is, (19) holds for the iteration \(k+1\). Hence, \(\|\bar{x}_{k}^{i}-\bar{x}_{k}\|\leqslant 2C_{1}/(k+2)\) for all \(k\geqslant 1\).
Next, we prove that \(\|\bar{x}^{i}_{k+1}-\bar{x}^{i}_{k}\|\leqslant \frac{2(d+2C_{1})}{k+2}\). From the definition of \(\bar{x}_{k}^{i}\), we have
where (a) holds for (7); (b) is due to the triangle inequality; (c) follows from Assumption 4. □
1.3 1.3 Technique lemmas
The following two Lemmas provide the bounds of \(\mathbb{E}[\|g^{i}_{k}\|]\) and \(\mathbb{E}[\|g^{i}_{k}\|^{2}]\) in Algorithm 1.
Lemma 7
Choose \(\beta _{k}=\frac{2}{k+1}\), \(\gamma _{k}=\frac{2}{k+2}\), and \(\rho _{k}\leqslant \frac{d+2C_{1}}{\sqrt{n}(k+2)}\). If Assumptions 1–5hold, then, for any \(k\geqslant 1\) and \(i\in \mathcal{N}\),
where \(\psi _{1}=\max \{\|g_{1}^{i}\|,2l+5L(d+2C_{1})\}\) and \(C_{1}=k_{0}\sqrt{N}d\).
Proof
Obviously, (23) holds for \(k=1\). We discuss the case when \(k>1\) in the next step. It follows from the update (3) that
where we use Facts 2 and 3 in (a); (b) holds by Fact 3 and the smoothness of \(h_{i}\) in Assumption 3; (c) holds because of the fact \(\rho _{k}\leqslant \rho _{k-1}\), \(\beta _{k}<1\), and Lemma 1. Taking \(\beta _{k}=\frac{2}{k+1}\), \(\gamma _{k}=\frac{2}{k+2}\), and \(\rho _{k}\leqslant \frac{d+2C_{1}}{\sqrt{n}(k+2)}\), we yield that
Using Lemma 5 with \(t_{0}=1\), \(a_{1}=a_{2}=1\), \(A_{1}=2\), and \(A_{2}=2nl+4L(d+2C_{1})\), we yield that \(\mathbb{E}[\|g^{i}_{k}\|]\leqslant \psi _{1}=\max \{\|g_{1}^{i} \|,2l+5L(d+2C_{1})\}\). □
Lemma 8
Suppose Assumptions 1–5hold. Choose \(\gamma _{k}=\frac{2}{k+2}\), \(\beta _{k}=\frac{2}{k+1}\), and \(\rho _{k}\leqslant \frac{d+2C_{1}}{\sqrt{n}(k+2)}\). Then, for any \(k\geqslant 1\) and \(i\in \mathcal{N}\),
where \(\psi _{2}=\max \{\|g^{i}_{1}\|^{2},10L\psi _{1}(d+2C_{1})+28L^{2}(d+2C_{1})^{2}+8l^{2}+4l \psi _{1}\}\).
Proof
Obviously, (24) holds for \(k=1\). Next, we consider the case when \(k>1\). It follows from the update (3) that
where we use the fact \(1-\beta _{k}\leqslant 1\) and the triangle inequality in the last inequality. Next, we concentrate on the term \(\|\hat{\nabla}h_{i}(\bar{x}^{i}_{k},\xi ^{i}_{k})-\hat{\nabla}h_{i}( \bar{x}^{i}_{k-1},\xi ^{i}_{k})\|\) on the RHS of (25). Introducing \(\nabla h_{i}(\bar{x}^{i}_{k},\xi ^{i}_{k})-\nabla h_{i}(\bar{x}^{i}_{k-1}, \xi ^{i}_{k})\), we have
where we use the triangle inequality, Fact 3, and Lemma 1 to obtain the result. Substituting (26) into (25), and taking the conditional expectation on \(\mathcal{F}_{k}\), we therefore have that
where the last inequality is due to (13), Fact 2, and Fact 3. Taking the full expectation on both sides of (27) and taking \(\beta _{k}=\frac{2}{k+1}\), \(\gamma _{k}=\frac{2}{k+2}\), \(\rho _{k}\leqslant \frac{d+2C_{1}}{\sqrt{n}(k+2)}\), it follows from (23) that
Using Lemma 5 with \(t_{0}=1\), \(a_{1}=a_{2}=1\), \(A_{1}=2\), and \(A_{2}=10L\psi _{1}(d+2C_{1})+28L^{2}(d+2C_{1})^{2}+8l^{2}+4l\psi _{1}\), we prove the result. □
1.4 1.4 Proof of Lemma 2
Proof
To obtain the result in (14), we prove that
by using induction on k. Firstly, we prove that (28) holds if \(1\leqslant k\leqslant k_{0}-2\). It follows from the updates (4) and (5) that \(s^{i}_{k}=y^{i}_{k+1}-g^{i}_{k+1}+g^{i}_{k}\). We have
where we use (13) in the first inequality and (24) in the last inequality. Next, we focus on the second term of the RHS of (29). It follows from the update (4) that
where (a) holds because of (13) and the Jensen’s inequality; (b) follows from (24); (c) is due to the fact that \(\mathbb{E}[\|y^{j}_{1}\|^{2}]=\mathbb{E}[\|\hat{\nabla}h_{i}(\bar{x}^{i}_{1}, \xi ^{i}_{1})-\nabla h_{i}(\bar{x}^{i}_{1},\xi ^{i}_{1})+\nabla h_{i}( \bar{x}^{i}_{1},\xi ^{i}_{1})\|^{2}]\leqslant 2l^{2}+L^{2}(d+2C_{1})^{2}/9\). We therefore yield that
for \(k< k_{0}-2\). Obviously, (28) is true when \(k\in \{1,k_{0}-2\}\).
For induction step, we assume that (28) is true for some \(k\geqslant k_{0}-2\). For convenience of analysis, we define \(\Delta g^{i}_{k+1}:=g^{i}_{k+1}-g^{i}_{k}\) and \(\Delta \hat{g}_{k+1}:=\bar{g}_{k+1}-\bar{g}_{k}\). According to the update (4), we have that \(y^{i}_{k+1}=\Delta g^{i}_{k+1}+s^{i}_{k}\). Further, it follows from Fact 1 and Lemma 6 (a) that
Next, we focus on the RHS of (30). It follows from the definitions of \(\Delta g^{i}_{k+1}\) and \(\Delta \hat{g}_{k+1}\) that
Recall the definition of \(\Delta g^{i}_{k+1}\) and the update (3). We bound \(\mathbb{E}[\|\Delta g^{i}_{k+1}\|^{2}]\) as
where (a) follows from (13), (24), and Fact 2; (b) holds by (11), the smoothness of \(h_{i}\), and the fact that \(\beta _{k+1}<1\); (c) holds because of the fact that \(\rho _{k+1}\leqslant \rho _{k}\) and Lemma 1. Furthermore, we use the result in (32) to yield that
where we use the fact that \(1-\frac{1}{N}\leqslant 1\) and the choice of \(\rho _{k}\), \(\beta _{k}\), \(\gamma _{k}\) in the last inequality. Taking full expectation on the RHS and LHS of (31), and then substituting (33) into the result, we obtain
where (a) is due to the Hölder’s inequality; (b) follows from the induction hypothesis. Hence, (30) is written as
where we use the relations \(|\lambda |\leqslant [k_{0}/(k_{0}+1)]^{2}\), \(\gamma _{k}=2/(k+2)\), and the monotonically increasing property of function \(g(x) = x/(1+x)\) with respect to x over \([0,+\infty )\). That is, (28) holds when \(k\leftarrow k+1\). The required result is obtained. □
1.5 1.5 Proof of Lemma 3
Proof
1) According to the definition of \(\bar{g}_{k}\) and the update (3), we have
Introducing \((1-\beta _{k})\hat{\nabla}\bar{h}_{k-1}\) into the RHS of the above equality and rearranging, we arrive at
Taking the squared-norm on RHS and LHS of (35) and then taking conditional expectation on \(\mathcal{F}_{k}\), we obtain
where the last equality holds due to the fact that
and
Next, we focus on the last term of the RHS of (36) and bounding it separately. For convenience, we define \(\mathbb{U}_{k}:=\beta _{k} (\frac{1}{N}\sum_{i=1}^{N}\hat{\nabla}h_{i}( \bar{x}^{i}_{k},\xi ^{i}_{k})-\hat{\nabla}\bar{h}_{k} )\) and \(\mathbb{V}_{k}:=\frac{(1-\beta _{k})}{N}\sum_{i=1}^{N}(\hat{\nabla}h_{i} ( \bar{x}^{i}_{k},\xi ^{i}_{k})) - \hat{\nabla}h_{i}(\bar{x}^{i}_{k-1}, \xi ^{i}_{k}))\). Hence, the second term of the RHS of (36) is modified as \(\mathbb{E}_{k}[\|\mathbb{U}_{k}+\mathbb{V}_{k}-\mathbb{E}_{k}[ \mathbb{V}_{k}]\|^{2}]\) and bounded by
where we use (13) and the Jensen’s inequality.
Next, we will derive the bounds of \(\mathbb{E}_{k}[\|\mathbb{U}_{k}\|^{2}]\) and \(\mathbb{E}_{k}[\|\mathbb{V}_{k}\|^{2}]\). Obviously, \(\hat{\nabla}\bar{h}_{t}=\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}_{k}[ \hat{\nabla}h_{i}(\bar{x}^{i}_{t},\xi ^{i}_{k})]=\frac{1}{N} \sum_{i=1}^{N} \hat{\nabla}H_{i}(\bar{x}^{i}_{t})\). It follows from the definitions of \(\mathbb{U}_{k}\) and \(\hat{\nabla}\bar{h}_{k}\) that
where (a) holds because of using the inequality (13); (b) follows from Fact 3 and the fact that \(\beta _{k}\leqslant 1\). Similarly, it follows from (13), Fact 3, and the smoothness of \(h_{i}\) that
where we use Lemma 1 and the fact \(\rho _{k}\leqslant \rho _{k-1}\), and drop the factor \((1-\beta _{k})^{2}\) in the last inequality. Substituting (38) and (39) into (37), we obtain
Substituting the above result into (36), we yield Eqn. (15).
2) Taking \(\beta _{k}=\frac{2}{k+1}\), \(\gamma _{k}=\frac{2}{k+2}\) and \(\rho _{k}\leqslant \frac{d+2C_{1}}{\sqrt{n}(k+2)}\), we rewrite (15) as
Using Lemma 5 with \(a_{1}=1\), \(t_{0}=1\), \(a_{2}=2\), \(A_{1}=2\) and \(A_{2}=156L^{2}(d+2C_{1})^{2}+24\delta ^{2}\), we yield that \(\mathbb{E}[\|\bar{g}_{k}-\hat{\nabla}\bar{h}_{k}\|^{2}]\leqslant \frac{C_{3}}{k+2} \), where \(C_{3}:=\max \{2\|\bar{g}_{1}-\hat{\nabla}h(x_{1})\|^{2},156L^{2}(d+2C_{1})^{2}+24 \delta ^{2}\}\). Focusing on \(\mathbb{E}[\|\bar{g}_{k}-\bar{p}_{k}\|^{2}]\) and introducing \(\frac{1}{N}\sum_{i=1}^{N}\hat{\nabla}H_{i}(\bar{x}^{i}_{k})\), according to the definition of \(\bar{p}_{k}\) and the relation \(\hat{\nabla}\bar{h}_{k}=\frac{1}{N}\sum_{i=1}^{N}\hat{\nabla}H_{i}( \bar{x}^{i}_{k})\), we have
where we use (13) in the first inequality, and the last inequality holds due to (12). □
1.6 1.6 Proof of Lemma 4
Proof
Focusing on the LHS of (17), adding and subtracting the term \((\bar{p}_{k}+\bar{g}_{k})\) into \(\|\nabla h(\bar{x}_{k})-s^{i}_{k}\|^{2}\), we have
where the last inequality follows from (13). The first term of the RHS of (40) is rewritten as
where we use the smoothness of \(h_{i}\) and Lemma 1. Substituting (41), (16) and (14) into (40) and taking \(\beta _{k}=\frac{2}{k+1}\), \(\gamma _{k}=\frac{2}{k+2}\), \(\rho _{k}\leqslant \frac{d+2C_{1}}{\sqrt{n}(k+2)}\), we have the result. □
1.7 1.7 Proof of Theorem 1
Proof
It follows from the update (7) in Algorithm 1 and Assumption 3 that
where we use Assumption 4 in the last inequality. Focusing on the second term of the RHS of (42) and the definition of \(\bar{z}_{k}\), we have
where we use the optimality of \(z^{i}_{k}\) in the last inequality. Adding and subtracting the term \(\frac{1}{N}\sum_{i=1}^{N}[\langle \nabla h(\bar{x}_{k})-s^{i}_{k},x^{*} \rangle ]\) into the RHS of the above inequality, we arrive at
where (a) holds by the convexity of function \(h(x)\); (b) follows from Assumption 4. Substituting (43) into (42), rearranging, and subtracting \(h(x^{*})\) from the RHS and LHS of the result, we arrive at
Taking the expectation on the RHS and LHS of (44), and then using the Jensen’s inequality in the last term of RHS of (44), we have
It follows from Lemma 4, \(\gamma _{k}=\frac{2}{k+2}\), and (45) that
Using Lemma 5 with
we prove the result. □
1.8 1.8 Proof of Theorem 2
Proof
Define \(v_{k}\in \operatorname{argmin}_{v\in \mathcal{X}}\langle \nabla h(\bar{x}_{k}),v \rangle \). We have \(p_{k}=\langle \nabla h(\bar{x}_{k}),\bar{x}_{k}-v_{k}\rangle \) by (18) and the definition of \(v_{k}\). We also obtain from the smoothness property of \(h(\cdot )\) that
where (a) holds by Lemma 6(b) and introducing \(s^{i}_{k}\); (b) is due to the optimality of \(z^{i}_{k}\) in the update (6). It follows from the definition of \(p_{k}\) and Assumption 4 that
Taking the full expectation on the RHS and LHS of (47) and using Jensen’s inequality, we yield that
where we use (17) in the last inequality, and substitute \(\gamma _{k}=\frac{2}{k+2}\) into the last inequality. Summing the RHS and LHS of (48) from \(k=1\) to \(k=K\) and rearranging, we have
Define m such that \(2^{m}=K\), i.e., \(m=\log _{2}(K)\). According to the property of p-series, we yield that \(m-1\leqslant \sum_{k=1}^{2^{m}}\frac{2}{k+2}\), \(\sum_{k=1}^{2^{m}}\frac{2}{(k+2)^{2}}\leqslant 4\), and there are some constant c such that \(\sum_{k=1}^{2^{m}}\frac{4d}{(k+2)^{\frac{3}{2}}}\leqslant c\). Hence, we rewrite (49) as
By rearranging and substituting \(m=\log _{2}(K)\), we obtain the result. □
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Hou, J., Zeng, X. & Chen, C. Distributed gradient-free and projection-free algorithm for stochastic constrained optimization. Auton. Intell. Syst. 4, 6 (2024). https://doi.org/10.1007/s43684-024-00062-0
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s43684-024-00062-0