26120-Article Text-30183-1-2-20230626
26120-Article Text-30183-1-2-20230626
26120-Article Text-30183-1-2-20230626
9345
To overcome these challenges, we propose a constructive measure to reduce the search space. However, to obtain high
approach based on hierarchical reinforcement learning (H- quality solutions, LKH-3 often takes hours or longer to ter-
TSP for short), that can obtain comparable results as SOTA minate when solving TSP with tens of thousands nodes.
approaches but using much less time up to two orders of Depending on how solutions are constructed, learning-
magnitude. Starting from a partial route (initially only con- based algorithms can be categorized into two categories:
taining the depot), our H-TSP approach decomposes solu- constructive-based methods and search-based methods.
tions construction into two steps: Firstly, we shall choose a Constructive-Based Methods Pointer network (Vinyals,
relatively small subset of cities from all remaining cities that Fortunato, and Jaitly 2015) is known as the first end-to-
are to be visited; Secondly, we solve a small open loop TSP end method that incrementally generates solutions of TSP
instance that only contains the chosen cities. Once a tour for from the scratch. The backbone model is a Recurrent Neu-
the chosen cities is obtained, it will be merged into the ex- ral Network, which is trained in a supervised manner. Dif-
isting partial route. Such a procedure will continue until all ferently, Bello (Bello et al. 2017) improved the perfor-
cities have been visited and a feasible solution is obtained. mance of pointer network by training it using reinforcement
Correspondingly, we devise two policies for the two steps: learning, hence can achieve policies with better generaliza-
one is to choose candidate cities to be traversed, while the tion. Inspired by the success of Transformer (Vaswani et al.
other one decides in which order these cities will be visited. 2017) in many fields, such an architecture has also been
These two policies are trained jointly using reinforcement extended to deal with TSP (Kool, Van Hoof, and Welling
learning algorithms so that we obtain an end-to-end algo- 2019; Kwon et al. 2020). Additionally, (Kool, Van Hoof,
rithm for solving TSP. Our H-TSP solves TSP in a divide- and Welling 2019) introduces a simple baseline based on
and-conquer manner so it can scale to large-scale TSP easily. a deterministic greedy roll-out to train the model using
The contributions are summarized as follows: REINFORCE (Williams 1992). The work in (Kwon et al.
• We propose an effective hierarchical framework, named 2020) further exploits the symmetries of TSP solutions, from
H-TSP, mainly for addressing the large-scale TSP. Using a which diverse roll-outs can be derived so that a more effi-
divide-and-conquer strategy, H-TSP is the first end-to-end cient baseline than (Kool, Van Hoof, and Welling 2019) can
approach that can scale to TSP with up to 10,000 nodes, to be obtained. However, most of these works focus on solv-
the best of our knowledge; ing TSP with cities no more than 100, except that (Dai et al.
• We have conducted extensive experiments to show effec- 2017) considers instances with up to 1,200 cities.
tiveness of H-TSP. As demonstrated by experiment results, Search-Based Methods In (da Costa et al. 2020), a
H-TSP can achieve comparable results of instances with up method called DRL-2opt is proposed that uses DRL algo-
to 10,000 nodes as SOTA search-based approaches, while rithms to train a policy. Such a policy will select suitable 2-
the time spent are less than 4 seconds. Notice that the com- opt operators to continuously improve the current solution.
putation time is reduced considerably, up to two orders of Another approach called VSR-LKH in Zheng (Zheng et al.
magnitude. Therefore, our approach will be particularly use- 2021) is proposed. VSR-LKH can be seen as a variant of
ful for time sensitive applications; LKH solver. Differently, it replaces α-nearness in LKH by
• Decomposition is a common technique to solve large- a Q-value table that is obtained by running reinforcement
scale combinatorial optimization problems (Mariescu- learning algorithms. Similar as our approach, an approach
Istodor and Fränti 2021). We believe that the framework pro- called Att-GCN+MCTS is introduced in (Fu, Qiu, and Zha
posed in this paper has potentials to be extended to other 2021) that utilizes decomposition mechanism to solve large-
problems. We leave it as our future work to solve other scale TSP. It consists of three stages: Firstly, a supervised
large-scale optimization problems following the divide-and- model is trained that can generate heat maps for small TSP
conquer framework. instances; Secondly, techniques like graph sampling, graph
converting, and heat map merging are introduced that can
generate heat maps of large-scale TSP from smaller ones;
Related Work Finally, a Monte Carlo tree search procedure is introduced to
Due to the importance of TSP in research and industry, there search for a good solution guided by the heat map. Search-
have been lots of studies on this topic. Instead of giving a based methods can often obtain high-quality solutions if
comprehensive overview, we mainly focus on related stud- given enough time. This limits their applications in scenarios
ies. We shall refer interested readers to Taillard and Hels- that are time sensitive.
gaun (2019) and Guo et al. (2019); Bengio, Lodi, and Prou- It is worth to mention another related work (Ma et al.
vost (2021) for overviews of heuristic approaches and other 2021) that relies on Hierarchical Reinforcement Learning
machine learning approaches, respectively. (HRL) to solve large-scale Dynamic Pickup and Delivery
Traditional TSP algorithms can be roughly classified into Problems (DPDPs) in practice. DPDPs can be seen as a vari-
two categories, exact algorithms and heuristic algorithms. ant of TSP, where nodes are unknown a priori but will be re-
Concorde (David L. Applegate et al. 2007) is one of the leased periodically. The objective is to assign these nodes
best exact solvers. It models TSP as mixed-integer program- to proper routes in nearly real time which minimize the
ming problems, then uses a branch-and-cut algorithm (Pad- total traverse cost. To address the problem, a hierarchical
berg and Rinaldi 1991) to solve it. LKH-3 (Helsgaun 2017) framework is introduced, where the upper-level policy dy-
is a SOTA heuristic for solving TSP. It adopts the idea of namically partitions the problem into sub-problems, while
local search and uses the k-opt operators and an α-nearness the lower-level policy tries to solve each sub-problem ef-
9346
ficiently. The main difference between this method and H- Algorithm 1: Hierarchical TSP Algorithm
TSP is that their method aims to partition a dynamic prob-
lem into multiple static sub-problems at the temporal level, Input: TSP instance V = {v1 , v2 , · · · , vN }, initial
while H-TSP aims to decompose a large-scale TSP problem solution τinit = {vd }
into a set of small sub-problems at the spatial level. Output: Solution route τ = {τ1 , τ2 , · · · , τN }
1 τ ← τinit , the nearest node v of vd ;
2 while len(τ ) < N do
Problem Definition 3 SubP rob ← GenerateSubP rob(V, τ ) ;
In this paper, we focus on two-dimensional Euclidean 4 SubSol ← SolveSubP rob(SubP rob) ;
TSP. Let G = (V, E) denote an undirected graph, where 5 τ ← M ergeSubSol(SubSol, τ ) ;
V = {vi , |1 ≤ i ≤ N } represents the set of nodes, E = 6 end
{ei,j |1 ≤ i, j ≤ N } is the set of edges, and N denotes the 7 return τ
number of nodes. For every edge eij , define cost(i, j) as
its traverse cost, namely, the distance between i and j. We
define a special node vd ∈ V representing the depot node
where the salesman starts from and ends at. A feasible so- handed over to the lower-level policy to solve it as a open-
lution of a TSP instance is defined as a Hamiltonian cycle loop TSP. Its solution will then be passed to the upper-level
that visits all the nodes in V exactly once. Our goal is to policy to merge into the existing partial route.
minimize the total cost of the solution route τ , which can be
written as L(τ ), shown in Eq.(1). Upper-Level Model
PN −1 As mentioned before, the upper-level model is to decompose
L(τ ) = i=1 cost(τi , τi+1 ) + cost(τN , τ1 ) (1) the large-scale TSP so that they can be solved efficiently
without significantly downgrading quality of final solutions.
Where τi is the i-th node in the route. Without loss of gen- To achieve this, we conduct decomposition in an adaptive
erality, we assume all coordinates are in [0, 1]. and dynamic manner. This differentiates from the existing
decomposition-based approach in (Fu, Qiu, and Zha 2021),
Sub-Problem Definition where all sub-problems are pre-generated before merging
Our hierarchical method decomposes TSP into small sub- them into the final solution. By interleaving decomposition
problems at the spatial level. In order to facilitate sub- and merging, the upper-level model can learn an adaptive
solutions merging, we define our sub-problem as a variant of policy that can make its best decision based on current par-
TSP, named open-loop TSP with fixed endpoints (Papadim- tial solution and distribution of remaining nodes.
itriou 1977). Given an undirected graph G = (V, E), the
open-loop TSP has two special nodes vs and vt in V rep- A Scalable Encoder One of the key obstacles in solving
resenting the source node and target node, respectively. A the large-scale TSP with DRL is to encode a large num-
feasible solution of a open-loop TSP is no longer a cycle but ber of edges in the graph. For achieving a scalable en-
a path, which starts from vs , visits all other nodes exactly coder, inspired by a technique used in 3D point cloud pro-
once and ends in vt . It is easy to see that solutions of two jection (Lang et al. 2019), we propose a Pixel Encoder to en-
open-loop TSPs can be merged to a close-loop path. The code the graph as pixels. The idea is to convert point clouds
two endpoints in the sub-problem need to be fixed and spec- into a pseudo-image, which are nodes of a TSP instance in
ified in advance, if not, the sub-solutions will have arbitrary our case.
endpoints, which further leads to a poor combined solution. As a first step, the 2D space is discretized into an evenly
spaced H × W grid, creating a set of pixels. Then the
nodes are divided into different clusters based on the grid
The Hierarchical Framework they are on. We augment features of each node with a
This paper proposes a Deep Reinforcement Learning (DRL) vector (xa , ya , ∆xg , ∆yg , ∆xc , ∆yc , xpre , ypre , xnxt , ynxt ,
based hierarchical framework to solve the large-scale TSP, mselect ), where (xa , ya ) is the absolute coordinate of the
which is denoted as H-TSP. Following the divide-and- node, (∆xg , ∆yg ) and (∆xc , ∆yc ) are the relative co-
conquer approach, H-TSP contains policies/models in two ordinates to the gird center and the node cluster cen-
levels, which are responsible for generating sub-problems ter, respectively. If the node has been visited, we let
and solving sub-problems, respectively. (xpre , ypre , xnxt , ynxt ) denote coordinates of its neighbors
The entire procedure of H-TSP is summarized in Algo- on the partial route, otherwise they are 0. The boolean vari-
rithm 1. It starts with an initial solution containing the depot, able mselect indicates whether this node has been visited or
and then inserts the node nearest to it as two fixed endpoints not.
of the first sub-problem. The upper-level policy is responsi- For a TSP instance with N nodes, we have a tensor of
ble for decomposing the original problem and merging sub- size (N, D), where D = 11 is the number of features.
solutions from the lower-level policy. As decomposition will Then this tensor is processed by a linear layer to generate
inevitably downgrade quality of final solutions, to alleviate a (N, C) sized high dimensional tensor. According to the
this we let the upper-level policy learn to generate a decom- divided clusters, we use a max operation over the C dimen-
position strategy in an adaptive and dynamic manner. On sion to get the feature of each grid, and we use zero padding
the other hand, once a sub-problem is identified, it will be for the empty grids. The combination of each grid forms
9347
a pseudo-image: a tensor of size (H, W, C). This pseudo- Algorithm 2: Sub-problem Generation
image can be further processed by a convolutional neural
network (CNN), resulting in an embedding vector of the Input: k-NN graph GkN N =(V, E), the partial
whole TSP instance for the DRL model. solution at step t τt = vt1 , vt2 , · · · , length of
The DRL model follows the actor-critic architecture, there the sub-problem subLength, maximum
is a policy head for the policy function and a value head for number of unvisited nodes maxN um, upper
the state value function. Both of the two heads are composed layer model U pperM odel
of fully connected layers and activation functions. Output: Sub-problem
P = vs1 , vs2 , · · · , vssubLength , two
Sub-Problem Generation and Merging The action space endpoints vs , vt ∈ P
of our upper-level policy is continuous with 2 dimensions, 1 P ← ∅, Sv ← τt , Qnew ← Deque();
each of which denotes coordination of a point in the grid. 2 U pperM odel inputs GkN N and τt , outputs
We illustrate in this paragraph how a sub-problem is gener- Coordpred ;
ated and merged given an upper-level policy. The procedure 3 vc is the unvisited node closest to Coordpred ;
is depicted in Algorithm 2. For a given action Coordpred , 4 vb is the visited node closest to vc ;
let vc be the node closest to Coordpred that has not been 5 Push vb to the end of Qnew ;
visited yet. We further let vb be the node closed to vc 6 while len(P ) ≤ maxN um and Qnew is not empty
that has already been visited. Then, we keep expanding the do
sub-problem by selecting nodes that have not been visited 7 vi ← P opF ront(Qnew ) ;
from neighbors of vb based on a k-Nearest Neighbor (k- 8 for vj ∈ NGkN N (vi ) and vj ∈ / Sv do
NN) graph, until size of the sub-problem reaches maxN um 9 Push vj to the end of Qnew ;
or all nodes are visited or selected. Intuitively, the k-NN 10 Add vj into Sv and P ;
graph is implemented by associating each node a set of k 11 end
closest nodes and the sub-problem expansion will follow a 12 end
breadth-first search on the k-NN graph. Finally, we enrich 13 oldLength = subLength − len(P ) ;
the sub-problem (SelectF ragment) with a fragment of vis- 14 Pt ← SelectF ragment(τt , vb , oldLength) ;
ited nodes centering at vb so that the resulting sub-problem 15 P ← P ∪ Pt ;
has nodes not greater than subLength. In this way, we break 16 vs , vt ← SetEndpoints(Pt ) ;
the existing partial route to obtain a path with two endpoints, 17 return P, vs , vt ;
while after solving the sub-problem as a open-loop TSP, we
would obtain another path with two endpoints.
Markov Decision Process Now we shall be able to intro- Neural Network The underlying neural network of our
duce underlying MDPs of upper-level policies formally in lower-level model is a Transformer network, which has been
this paragraph. Let MG = ⟨S, A, P, R, γ⟩ denote an MDP widely used in the fields of natural language processing and
modelling a given TSP instance G = (V, E), where computer vision in recent years. It consists of a Multi-Head
• S is the set of all states containing all possible path frag- Attention and a Multi-Layer Perceptron layer, with a mask
ments τ of G; mechanism to remove all invalid actions.
• A = [0, 1] × [0, 1] is the set of all actions containing all Our neural network follows the encoder-decoder struc-
points in a unit grid; ture, where the encoder uses self-attention layers to encode
• P : S ×A → S is a deterministic transition function given the input node sequence, while the decoder outputs a se-
both upper-level and lower-lever policies; quence of nodes in an auto-regressive manner. In approaches
• R : S × A × S → R is the reward function defined by presented in (Kool, Van Hoof, and Welling 2019; Kwon et al.
R(τ, a, τ ′ ) = L(τ ′ ) − L(τ ), where τ is the current partial 2020), the following context is used as input of the encoder,
route, and τ ′ is the previous partial route;
• γ is the discount factor, which is set to 1 in our experi- qcontext = qgraph + qf irst + qlast (2)
ments.
where qgraph , qf irst , and qlast represent the feature vec-
Lower-Level Model tors of the whole graph, the first node and the last node of
The lower-level model is trained for solving open-loop TSPs the current partial solution, respectively. While enough for
with fixed endpoints generated by the upper-level model. As TSPs, it is inadequate for open-loop TSPs where we shall
lower-level policies will be launched for many times during keep in mind that there are two fixed endpoints. Therefore,
training and interference, its performance will have a sig- we add two more vectors to encode the features of the two
nificant impact on the performance of our approach. For- endpoints, namely, input of our encoder is a context vector
tunately, there have been many end-to-end approaches that defined as follows:
can solve relatively small-scale TSPs effectively and effi- qcontext = qgraph + qf irst + qlast + qsource + qtarget (3)
ciently (Kool, Van Hoof, and Welling 2019; Kwon et al.
2020). We adopt main ideas of these approaches to devise The POMO approach introduced in (Kwon et al. 2020)
an efficient lower-level policy, which we will briefly illus- takes advantage of the symmetry property of TSPs, which
trate in this section. improves its performance considerably. Although open-loop
9348
TSPs do not have the same symmetry property due to ex- Lower-Level Model
isting of fixed endpoints, we can achieve such a symme- The lower-level model is an end-to-end model for solving
try property easily as follows: During the node selection, all open-loop TSPs with relatively a small amount of nodes. It
nodes except the endpoints will be treated as in TSP without is trained by the classic REINFORCE (Williams 1992) al-
any constraint. Whenever an endpoint is chosen, we let the gorithm with a shared baseline, as in (Kool, Van Hoof, and
other one be chosen automatically. The final solution of the Welling 2019; Kwon et al. 2020). The REINFORCE algo-
origin open-loop TSP is obtained by removing the redundant rithm collects experience by Monte Carlo sampling and the
edge between the two endpoints. policy gradient is computed as follows:
Markov Decision Process The underlying MDPs of the
lower-level policies can be defined similarly as in (Kool, Van
∇θ J(θ) = Eπθ [∇θ log πθ (τ | s)Aπθ (τ )]
Hoof, and Welling 2019; Kwon et al. 2020), where PN (9)
i
≈ N1 − b(s) ∇θ log πθ τ i | s
• States: The states contain all possible contexts defined as i=1 R τ
in Eq. (3);
• Action: The actions contain all nodes in a TSP, with dy- where τ denotes a trajectory, namely, a feasible solution of
namic masks to remove nodes that have been visited; a TSP instance. The reward R(τ i ) = −L(τ i ) is defined as
• Rewards: A reward equating the negative cost of a route the negative cost of τ i . The shared baseline b(s) is used to
is assigned whenever a state corresponding to a feasible so- reduce the variance and improve the training stability, which
lution is encountered; otherwise the reward is 0. is obtained by averaging the return of a set of trajectories that
are generated from the same instance:
Training PN
b(s) = N1 i=1 R τ i
(10)
The proposed framework are trained by a hierarchical DRL
algorithm. More specifically, the two levels of models are Joint Training
trained with DRL jointly.
In order to improve the performance of the upper-level
Upper-Level Model and lower-level models, we adopt a joint training strategy.
The upper-level model is trained by the known Proximal Specifically, the current lower-level policy will be used to
Policy Optimization (PPO) (Schulman et al. 2017) algo- collect trajectories for training the upper-level model, and in
rithm, which is one of SOTA DRL algorithms based on an the meanwhile, sub-problems generated by the upper-level
actor-critic architecture. It learns a stochastic policy by min- policy will in turn be stored to train the lower-level model.
imizing the following clipped objective function: By such an interleaving training procedure, policies in two
h i levels can receive instant feedback from each other, hence
L(θ) = Êt min rt (θ)Ât , clip (rt (θ), 1 − ϵ, 1 + ϵ) Ât make the learning of a cooperative policy possible.
(4) As mentioned before, solution quality of lower-level poli-
where rt (θ) = ππθ θ (a(at |s t)
t |st )
denotes the probability ratio of cies has a significant impact on the final solution. If we start
old from a random lower-level policy, the upper-level policy
two policies, Ât denotes the advantage function, ϵ is a hy- would receive much misleading feedback making its train-
perparameter controlling the clipping range. The advantage ing hard to converge. To alleviate it, we introduce a warm-up
function represents the advantage of the current policy over stage for the lower-level model by pre-training it with sub-
the old policy, here we use the Generalized Advantage Esti- problems randomly generated from the original TSP. Ac-
mator (GAE) to compute the advantage. cording to our experiment, such a warm-up stage will ac-
celerate convergence and make the training more stable.
P∞ l
Ât = l=1 (γλ) r t + γ V̂ (s t+l+1 ) − V̂ (s t+l ) (5)
Experiments
where rt is the reward at time t, V̂ is the state value function,
γ is the discount factor, and λ is the hyper-parameter that To demonstrate how our approach works on the large-scale
controls the compromise between bias and variance of the TSP problem, we adopt four datasets to evaluate it. The
estimated advantage. four datasets contains TSP instances with problem sizes
Besides the policy loss, we also add the value loss and of 1000, 2000, 5000, and 10000 nodes, denoted as Ran-
entropy loss: dom1000, Random2000, Random5000, and Random10000,
h i2 respectively. To make experiment results comparable, Ran-
LV̂ (θ) = Êt V̂ − V̂θ (6) dom1000 and Random10000 contain the same instances
used by Fu et al. in their work (Fu, Qiu, and Zha 2021),
LE (θ) = Êt [πθ (a | s) log πθ (a | s)] (7) while instances in Random2000 and Random5000 are gen-
The total loss of upper-level model is: erated with nodes that are uniformly distributed in a unit
square, in line with existing approaches. Each dataset con-
LU P P ER (θ) = λp L(θ) + λv LV̂ (θ) − λe LE (θ) (8) tains 16 TSP instances except Random1000, which contains
where λv is the weight of value loss, and λe is the weight 128 instances. All our experiment results were obtained on
of entropy loss for balancing the policy’s exploration and a machine with an NVIDIA® Tesla V100 (16GB) GPU and
exploitation. Intel(R) Xeon(R) Platinum CPU.
9349
Hyper-Parameters Setting The upper-level model con- Random1000 Random2000
sists of a pixel encoder and a DRL agent model. We use Algorithm
Length Gap Time Length Gap Time
a 3-layer CNN for the pixel encoder with 16, 32, 32 chan- (%) (s) (%) (s)
nels respectively, and it outputs a 128 dimensional feature
vector. And our DRL model with actor-critic architecture Concorde 23.12 0.00 487.89 32.48 0.00 7949.97
LKH-3 23.16 0.17 22.01 32.64 0.49 79.75
consists of an actor network and a critic network, each of
OR-Tools 24.23 4.82 104.34 34.04 4.82 532.14
them is a 4-layer MLP. The lower-level model follows the POMO 30.52 32.01 4.28 46.49 43.15 35.89
encoder-decoder structure, there is a 12-layer self-attention DRL-2opt 37.90 63.93 55.56 115.59 255.92 827.43
encoder and a 2-layer context-attention decoder. Most of the Att-GCN
embedding dimension in the neural network is set to 128 ex- 23.86 3.22 5.85 33.42 2.91 200.28
+MCTS
cept for the CNN layers, the first encoding layer and the out- H-TSP 24.65 6.62 0.33 34.88 7.39 0.72
putting layer. During training, we use the AdamW optimizer
with a learning rate of 1e-4 and a weight decay of 1e-6. For Random5000 Random10000
the sub-problem generation stage, we set k = 40 for the Algorithm
Length Gap Time Length Gap Time
k-nearest neighbor and set the sub-problem length as 200 (%) (s) (%) (s)
and the maximum number of new nodes in sub-problem as
190. The lower-level model is trained for 500 epochs in the LKH-3 51.36 0.00 561.74 72.45 0.00 4746.59
OR-Tools 53.35 3.86 5368.24 74.95 3.44 21358.66
warm-up stage, and the joint training stage takes 500, 1000, POMO 80.79 57.29 575.63 OOM OOM OOM
1500, 2000 epochs respectively for different datasets. Our DRL-2opt 754.91 1369.76 2308.48 2860.86 3848.66 6073.43
algorithm is implemented based on PyTorch (Paszke et al. Att-GCN
2019), the trained models and related data are publicly avail- 52.83 2.86 377.47 74.93 3.42 395.85
+MCTS
able. 1 H-TSP 55.01 7.10 1.66 77.75 7.32 3.32
9350
12 Random1000 Random2000
DRL Upper DRL Lower
250 Algorithm
10 Random Upper Heurist ic Lower Gap Time Gap Time
Opt im alit y gap / %
2 50 Random5000 Random10000
Algorithm
0 0
Gap Time Gap Time
TSP1000 TSP2000 TSP5000TSP10000 TSP1000 TSP2000 TSP5000TSP10000 (%) (s) (%) (s)
Num ber of nodes Num ber of nodes
H-TSP 7.10 1.66 7.32 3.32
H-TSP with LKH-3 5.10 15.12 5.57 27.94
Figure 2: Optimality gap of models with different upper- Att-GCN+MCTS 2.86 377.47 3.42 395.85
level and lower-level model
Table 2: Comparison of H-TSP and its variant with LKH-3
ity of H-TSP, we test the four trained models on randomly
generated datasets with different number of nodes. Figure 1 Gap (%) ∆ Gap (%)
shows that H-TSP has a good generalization performance
H-TSP 6.76 0.00
with TSP instances from 1000 nodes to 50000 nodes. Note
w/o visited fragment 7.42 +0.66
that the optimality gap on Random50000 is smaller than the
w/o k-NN 7.69 +0.93
gap of Random20000, because the optimal solutions of these
w/o joint training 7.53 +0.77
datasets are generated by LKH-3, and the solution quality of
w/o warm-up 27.05 +20.29
LKH-3 also declines as the number of nodes increases.
9351
References Large-Scale Dynamic Pickup and Delivery Problems. In
Applegate, D. L.; Bixby, R. E.; Chvátal, V.; Cook, W.; Es- Ranzato, M.; Beygelzimer, A.; Dauphin, Y.; Liang, P.; and
pinoza, D. G.; Goycoolea, M.; and Helsgaun, K. 2009. Cer- Vaughan, J. W., eds., Advances in Neural Information Pro-
tification of an optimal TSP tour through 85,900 cities. Op- cessing Systems, volume 34, 23609–23620. Curran Asso-
erations Research Letters, 37(1): 11–15. ciates, Inc.
Bello, I.; Pham, H.; Le, Q. V.; Norouzi, M.; and Bengio, Mariescu-Istodor, R.; and Fränti, P. 2021. Solving the Large-
S. 2017. Neural combinatorial optimization with reinforce- Scale TSP Problem in 1 h: Santa Claus Challenge 2020.
ment learning. 5th International Conference on Learning Frontiers in robotics and AI, 8: 689908–689908.
Representations, ICLR 2017 - Workshop Track Proceedings. Nowak, A.; Villar, S.; Bandeira, A. S.; and Bruna, J. 2017.
Bengio, Y.; Lodi, A.; and Prouvost, A. 2021. Machine A Note on Learning Algorithms for Quadratic Assign-
learning for combinatorial optimization: A methodological ment with Graph Neural Networks. ArXiv e-prints, 1706:
tour d’horizon. European Journal of Operational Research, arXiv:1706.07450.
290(2): 405–421. Padberg, M.; and Rinaldi, G. 1991. A Branch-and-Cut Al-
da Costa, P. R. d. O.; Rhuggenaath, J.; Zhang, Y.; and Akcay, gorithm for the Resolution of Large-Scale Symmetric Trav-
A. 2020. Learning 2-Opt Heuristics for the Traveling Sales- eling Salesman Problems. SIAM review, 33(1): 60–100.
man Problem via Deep Reinforcement Learning. In Pan, Papadimitriou, C. H. 1977. The Euclidean travelling sales-
S. J.; and Sugiyama, M., eds., Proceedings of the 12th Asian man problem is NP-complete. Theoretical Computer Sci-
Conference on Machine Learning, volume 129 of Proceed- ence, 4: 237–244.
ings of Machine Learning Research, 465–480. PMLR.
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.;
Dai, H.; Khalil, E. B.; Zhang, Y.; Dilkina, B.; and Song, L.
Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga,
2017. Learning combinatorial optimization algorithms over
L.; Desmaison, A.; Kopf, A.; Yang, E.; DeVito, Z.; Raison,
graphs. Advances in Neural Information Processing Sys-
M.; Tejani, A.; Chilamkurthy, S.; Steiner, B.; Fang, L.; Bai,
tems, 2017-December: 6349–6359.
J.; and Chintala, S. 2019. PyTorch: An Imperative Style,
David L. Applegate; Robert E. Bixby; Vašek Chvátal; and High-Performance Deep Learning Library. In Wallach, H.;
William J. Cook. 2007. The Traveling Salesman Problem: A Larochelle, H.; Beygelzimer, A.; d'Alché-Buc, F.; Fox, E.;
Computational Study. Princeton University Press. and Garnett, R., eds., Advances in Neural Information Pro-
Fu, Z.-H.; Qiu, K.-B.; and Zha, H. 2021. Generalize a Small cessing Systems, volume 32. Curran Associates, Inc.
Pre-trained Model to Arbitrarily Large TSP Instances. Pro- Rosenkrantz, D. J.; Stearns, R. E.; and Lewis, P. M. 1974.
ceedings of the AAAI Conference on Artificial Intelligence, Approximate algorithms for the traveling salesperson prob-
35(8): 7474–7482. lem. In 15th Annual Symposium on Switching and Automata
Ghiani, G.; Guerriero, F.; Laporte, G.; and Musmanno, R. Theory (swat 1974), 33–42. IEEE.
2003. Real-time vehicle routing: Solution concepts, algo-
rithms and parallel computing strategies. European Journal Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and
of Operational Research, 151(1): 1–11. Klimov, O. 2017. Proximal policy optimization algorithms.
arXiv preprint arXiv:1707.06347.
Guo, T.; Han, C.; Tang, S.; and Ding, M. 2019. Solving
Combinatorial Problems with Machine Learning Methods. Taillard, É. D.; and Helsgaun, K. 2019. POPMUSIC for the
Springer Optimization and Its Applications, 147: 207–229. travelling salesman problem. European Journal of Opera-
tional Research, 272(2): 420–429.
Helsgaun, K. 2017. An Extension of the Lin-Kernighan-
Helsgaun TSP Solver for Constrained Traveling Salesman Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
and Vehicle Routing Problems. Roskilde: Roskilde Univer- L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-
sity. tention is all you need. Advances in Neural Information Pro-
Kool, W.; Van Hoof, H.; and Welling, M. 2019. Attention, cessing Systems, 2017-December: 5999–6009.
learn to solve routing problems! 7th International Confer- Vinyals, O.; Fortunato, M.; and Jaitly, N. 2015. Pointer Net-
ence on Learning Representations, ICLR 2019. works. In Cortes, C.; Lawrence, N.; Lee, D.; Sugiyama, M.;
Kwon, Y. D.; Choo, J.; Kim, B.; Yoon, I.; Gwon, Y.; and and Garnett, R., eds., Advances in Neural Information Pro-
Min, S. 2020. POMO: Policy optimization with multiple cessing Systems, volume 28 of NIPS’15. Cambridge, MA,
optima for reinforcement learning. Advances in Neural In- USA: Curran Associates, Inc.
formation Processing Systems, 2020-December. Williams, R. J. 1992. Simple Statistical Gradient-Following
Lang, A. H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; and Algorithms for Connectionist Reinforcement Learning. Ma-
Beijbom, O. 2019. PointPillars: Fast Encoders for Object chine learning, 8(3): 229–256.
Detection From Point Clouds. In 2019 IEEE/CVF Confer- Xu, Z.; Li, Z.; Guan, Q.; Zhang, D.; Li, Q.; Nan, J.; Liu,
ence on Computer Vision and Pattern Recognition (CVPR), C.; Bian, W.; and Ye, J. 2018. Large-scale order dispatch
12689–12697. in on-demand ride-hailing platforms: A learning and plan-
Ma, Y.; Hao, X.; Hao, J.; Lu, J.; Liu, X.; Xialiang, T.; Yuan, ning approach. Proceedings of the ACM SIGKDD Interna-
M.; Li, Z.; Tang, J.; and Meng, Z. 2021. A Hierarchical Re- tional Conference on Knowledge Discovery and Data Min-
inforcement Learning Based Optimization Framework for ing, 905–913.
9352
Zheng, J.; He, K.; Zhou, J.; Jin, Y.; and Li, C.-M. 2021.
Combining Reinforcement Learning with Lin-Kernighan-
Helsgaun Algorithm for the Traveling Salesman Problem.
Proceedings of the AAAI Conference on Artificial Intelli-
gence, 35(14): 12445–12452.
9353