26120-Article Text-30183-1-2-20230626

The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23)
H-TSP: Hierarchically Solving the Large-Scale Traveling Salesman Problem

Xuanhao Pan1 , Yan Jin1 * , Yuandong Ding1 , Mingxiao Feng2 , Li Zhao3 , Lei Song3 , Jiang Bian3
1
School of Computer Science, Huazhong University of Science and Technology, China,
2
University of Science and Technology of China,
3
Microsoft Research Asia
jinyan@hust.edu.cn
Abstract stances. This limits their applications in scenarios that are

time sensitive, e.g. on-call routing (Ghiani et al. 2003) and
We propose an end-to-end learning framework based on hier-
ride hailing service (Xu et al. 2018).
archical reinforcement learning, called H-TSP, for addressing
the large-scale Traveling Salesman Problem (TSP). The pro- To deal with such time sensitive applications, learning-
posed H-TSP constructs a solution of a TSP instance starting based approaches for TSP are expected to be used as they are
from the scratch relying on two components: the upper-level very efficient during inference. They can learn useful pat-
policy chooses a small subset of nodes (up to 200 in our ex- terns from massive data during training which can generalize
periment) from all nodes that are to be traversed, while the to unseen instances. Moreover, they do not rely on problem
lower-level policy takes the chosen nodes as input and outputs specific knowledge, hence can be extended to handle similar
a tour connecting them to the existing partial route (initially problems. Because of these advantages, there have been a
only containing the depot). After jointly training the upper- soaring number of studies to solve TSP in recent years, for
level and lower-level policies, our approach can directly gen-
erate solutions for the given TSP instances without relying
instance (Vinyals, Fortunato, and Jaitly 2015; Nowak et al.
on any time-consuming search procedures. To demonstrate 2017; Kool, Van Hoof, and Welling 2019; Kwon et al. 2020;
effectiveness of the proposed approach, we have conducted Fu, Qiu, and Zha 2021).
extensive experiments on randomly generated TSP instances Depending on how solutions are constructed, learning
with different numbers of nodes. We show that H-TSP can based approaches can be classified into two categories: it-
achieve comparable results (gap 3.42% vs. 7.32%) as SOTA erative and constructive. For iterative approaches, a feasible
search-based approaches, and more importantly, we reduce solution shall be constructed at the beginning, either ran-
the time consumption up to two orders of magnitude (3.32s domly or using some heuristic methods. The main step is
vs. 395.85s). To the best of our knowledge, H-TSP is the
first end-to-end deep reinforcement learning approach that
to learn a model that knows how to improve a feasible so-
can scale to TSP instances of up to 10,000 nodes. Although lution. Finally, the model will be applied to keep improv-
there are still gaps to SOTA results with respect to solution ing the current solution until a termination condition is met.
quality, we believe that H-TSP will be useful for practical ap- On the other hand, constructive approaches start from the
plications, particularly those that are time-sensitive e.g., on- initial city and learn which city to go at each step. While
call routing and ride hailing service. these approaches have achieved competitive performance in
relatively small TSP instances (less than 1000 cities), they
cannot be extended to deal with large-scale TSP easily. One
Introduction exception is (Fu, Qiu, and Zha 2021) which can achieve so-
Traveling salesman problem (TSP) is a well-known com- lutions close enough to LKH3 solutions (gap < 5%) but in
binatorial optimization problem. It has been studied in op- relatively shorter time. However, in (Fu, Qiu, and Zha 2021)
eration research community for many years. The best ex- a Monte Carlo tree search procedure is required to improve
act solver Concorde (Applegate et al. 2009) requires 136 solutions constantly, which is still time-consuming. Accord-
CPU years to find the optimal solution for an instance with ing to the experiment results in (Fu, Qiu, and Zha 2021),
85,900 cities. Such computation time is unacceptable, so for one TSP instance with 10,000 cities, it costs around 11
many heuristics have been proposed to obtain near-optimal minutes until it finds a solution comparable with LKH3.
solutions for problems arising in practice. For instance, one We noticed that the iterative approaches are time-
of the leading heuristics algorithms LKH3 (Helsgaun 2017) consuming on training and inference of large-scale TSPs,
can handle TSP instances with millions of cities. How- since they require to constantly improve feasible solutions
ever, these algorithms consist of hand-crafted rules that are to obtain solutions of high quality by learning a proper oper-
specific to TSP. More importantly, heuristics rely on itera- ator from some domain-specific heuristic operators. And the
tive search and are also time-consuming for large-scale in- constructive approaches are efficient on inference, but due
* Yan Jin is the corresponding author. to their action spaces grow linearly in the number of cities,
Copyright © 2023, Association for the Advancement of Artificial their training procedures would easily run out of memory or
Intelligence (www.aaai.org). All rights reserved. time before converging to some near-optimal solutions.
9345
To overcome these challenges, we propose a constructive measure to reduce the search space. However, to obtain high
approach based on hierarchical reinforcement learning (H- quality solutions, LKH-3 often takes hours or longer to ter-
TSP for short), that can obtain comparable results as SOTA minate when solving TSP with tens of thousands nodes.
approaches but using much less time up to two orders of Depending on how solutions are constructed, learning-
magnitude. Starting from a partial route (initially only con- based algorithms can be categorized into two categories:
taining the depot), our H-TSP approach decomposes solu- constructive-based methods and search-based methods.
tions construction into two steps: Firstly, we shall choose a Constructive-Based Methods Pointer network (Vinyals,
relatively small subset of cities from all remaining cities that Fortunato, and Jaitly 2015) is known as the first end-to-
are to be visited; Secondly, we solve a small open loop TSP end method that incrementally generates solutions of TSP
instance that only contains the chosen cities. Once a tour for from the scratch. The backbone model is a Recurrent Neu-
the chosen cities is obtained, it will be merged into the ex- ral Network, which is trained in a supervised manner. Dif-
isting partial route. Such a procedure will continue until all ferently, Bello (Bello et al. 2017) improved the perfor-
cities have been visited and a feasible solution is obtained. mance of pointer network by training it using reinforcement
Correspondingly, we devise two policies for the two steps: learning, hence can achieve policies with better generaliza-
one is to choose candidate cities to be traversed, while the tion. Inspired by the success of Transformer (Vaswani et al.
other one decides in which order these cities will be visited. 2017) in many fields, such an architecture has also been
These two policies are trained jointly using reinforcement extended to deal with TSP (Kool, Van Hoof, and Welling
learning algorithms so that we obtain an end-to-end algo- 2019; Kwon et al. 2020). Additionally, (Kool, Van Hoof,
rithm for solving TSP. Our H-TSP solves TSP in a divide- and Welling 2019) introduces a simple baseline based on
and-conquer manner so it can scale to large-scale TSP easily. a deterministic greedy roll-out to train the model using
The contributions are summarized as follows: REINFORCE (Williams 1992). The work in (Kwon et al.
• We propose an effective hierarchical framework, named 2020) further exploits the symmetries of TSP solutions, from
H-TSP, mainly for addressing the large-scale TSP. Using a which diverse roll-outs can be derived so that a more effi-
divide-and-conquer strategy, H-TSP is the first end-to-end cient baseline than (Kool, Van Hoof, and Welling 2019) can
approach that can scale to TSP with up to 10,000 nodes, to be obtained. However, most of these works focus on solv-
the best of our knowledge; ing TSP with cities no more than 100, except that (Dai et al.
• We have conducted extensive experiments to show effec- 2017) considers instances with up to 1,200 cities.
tiveness of H-TSP. As demonstrated by experiment results, Search-Based Methods In (da Costa et al. 2020), a
H-TSP can achieve comparable results of instances with up method called DRL-2opt is proposed that uses DRL algo-
to 10,000 nodes as SOTA search-based approaches, while rithms to train a policy. Such a policy will select suitable 2-
the time spent are less than 4 seconds. Notice that the com- opt operators to continuously improve the current solution.
putation time is reduced considerably, up to two orders of Another approach called VSR-LKH in Zheng (Zheng et al.
magnitude. Therefore, our approach will be particularly use- 2021) is proposed. VSR-LKH can be seen as a variant of
ful for time sensitive applications; LKH solver. Differently, it replaces α-nearness in LKH by
• Decomposition is a common technique to solve large- a Q-value table that is obtained by running reinforcement
scale combinatorial optimization problems (Mariescu- learning algorithms. Similar as our approach, an approach
Istodor and Fränti 2021). We believe that the framework pro- called Att-GCN+MCTS is introduced in (Fu, Qiu, and Zha
posed in this paper has potentials to be extended to other 2021) that utilizes decomposition mechanism to solve large-
problems. We leave it as our future work to solve other scale TSP. It consists of three stages: Firstly, a supervised
large-scale optimization problems following the divide-and- model is trained that can generate heat maps for small TSP
conquer framework. instances; Secondly, techniques like graph sampling, graph
converting, and heat map merging are introduced that can
generate heat maps of large-scale TSP from smaller ones;
Related Work Finally, a Monte Carlo tree search procedure is introduced to
Due to the importance of TSP in research and industry, there search for a good solution guided by the heat map. Search-
have been lots of studies on this topic. Instead of giving a based methods can often obtain high-quality solutions if
comprehensive overview, we mainly focus on related stud- given enough time. This limits their applications in scenarios
ies. We shall refer interested readers to Taillard and Hels- that are time sensitive.
gaun (2019) and Guo et al. (2019); Bengio, Lodi, and Prou- It is worth to mention another related work (Ma et al.
vost (2021) for overviews of heuristic approaches and other 2021) that relies on Hierarchical Reinforcement Learning
machine learning approaches, respectively. (HRL) to solve large-scale Dynamic Pickup and Delivery
Traditional TSP algorithms can be roughly classified into Problems (DPDPs) in practice. DPDPs can be seen as a vari-
two categories, exact algorithms and heuristic algorithms. ant of TSP, where nodes are unknown a priori but will be re-
Concorde (David L. Applegate et al. 2007) is one of the leased periodically. The objective is to assign these nodes
best exact solvers. It models TSP as mixed-integer program- to proper routes in nearly real time which minimize the
ming problems, then uses a branch-and-cut algorithm (Pad- total traverse cost. To address the problem, a hierarchical
berg and Rinaldi 1991) to solve it. LKH-3 (Helsgaun 2017) framework is introduced, where the upper-level policy dy-
is a SOTA heuristic for solving TSP. It adopts the idea of namically partitions the problem into sub-problems, while
local search and uses the k-opt operators and an α-nearness the lower-level policy tries to solve each sub-problem ef-
9346
ficiently. The main difference between this method and H- Algorithm 1: Hierarchical TSP Algorithm
TSP is that their method aims to partition a dynamic prob-
lem into multiple static sub-problems at the temporal level, Input: TSP instance V = {v1 , v2 , · · · , vN }, initial
while H-TSP aims to decompose a large-scale TSP problem solution τinit = {vd }
into a set of small sub-problems at the spatial level. Output: Solution route τ = {τ1 , τ2 , · · · , τN }
1 τ ← τinit , the nearest node v of vd ;
2 while len(τ ) < N do
Problem Definition 3 SubP rob ← GenerateSubP rob(V, τ ) ;
In this paper, we focus on two-dimensional Euclidean 4 SubSol ← SolveSubP rob(SubP rob) ;
TSP. Let G = (V, E) denote an undirected graph, where 5 τ ← M ergeSubSol(SubSol, τ ) ;
V = {vi , |1 ≤ i ≤ N } represents the set of nodes, E = 6 end
{ei,j |1 ≤ i, j ≤ N } is the set of edges, and N denotes the 7 return τ
number of nodes. For every edge eij , define cost(i, j) as
its traverse cost, namely, the distance between i and j. We
define a special node vd ∈ V representing the depot node
where the salesman starts from and ends at. A feasible so- handed over to the lower-level policy to solve it as a open-
lution of a TSP instance is defined as a Hamiltonian cycle loop TSP. Its solution will then be passed to the upper-level
that visits all the nodes in V exactly once. Our goal is to policy to merge into the existing partial route.
minimize the total cost of the solution route τ , which can be
written as L(τ ), shown in Eq.(1). Upper-Level Model
PN −1 As mentioned before, the upper-level model is to decompose
L(τ ) = i=1 cost(τi , τi+1 ) + cost(τN , τ1 ) (1) the large-scale TSP so that they can be solved efficiently
without significantly downgrading quality of final solutions.
Where τi is the i-th node in the route. Without loss of gen- To achieve this, we conduct decomposition in an adaptive
erality, we assume all coordinates are in [0, 1]. and dynamic manner. This differentiates from the existing
decomposition-based approach in (Fu, Qiu, and Zha 2021),
Sub-Problem Definition where all sub-problems are pre-generated before merging
Our hierarchical method decomposes TSP into small sub- them into the final solution. By interleaving decomposition
problems at the spatial level. In order to facilitate sub- and merging, the upper-level model can learn an adaptive
solutions merging, we define our sub-problem as a variant of policy that can make its best decision based on current par-
TSP, named open-loop TSP with fixed endpoints (Papadim- tial solution and distribution of remaining nodes.
itriou 1977). Given an undirected graph G = (V, E), the
open-loop TSP has two special nodes vs and vt in V rep- A Scalable Encoder One of the key obstacles in solving
resenting the source node and target node, respectively. A the large-scale TSP with DRL is to encode a large num-
feasible solution of a open-loop TSP is no longer a cycle but ber of edges in the graph. For achieving a scalable en-
a path, which starts from vs , visits all other nodes exactly coder, inspired by a technique used in 3D point cloud pro-
once and ends in vt . It is easy to see that solutions of two jection (Lang et al. 2019), we propose a Pixel Encoder to en-
open-loop TSPs can be merged to a close-loop path. The code the graph as pixels. The idea is to convert point clouds
two endpoints in the sub-problem need to be fixed and spec- into a pseudo-image, which are nodes of a TSP instance in
ified in advance, if not, the sub-solutions will have arbitrary our case.
endpoints, which further leads to a poor combined solution. As a first step, the 2D space is discretized into an evenly
spaced H × W grid, creating a set of pixels. Then the
nodes are divided into different clusters based on the grid
The Hierarchical Framework they are on. We augment features of each node with a
This paper proposes a Deep Reinforcement Learning (DRL) vector (xa , ya , ∆xg , ∆yg , ∆xc , ∆yc , xpre , ypre , xnxt , ynxt ,
based hierarchical framework to solve the large-scale TSP, mselect ), where (xa , ya ) is the absolute coordinate of the
which is denoted as H-TSP. Following the divide-and- node, (∆xg , ∆yg ) and (∆xc , ∆yc ) are the relative co-
conquer approach, H-TSP contains policies/models in two ordinates to the gird center and the node cluster cen-
levels, which are responsible for generating sub-problems ter, respectively. If the node has been visited, we let
and solving sub-problems, respectively. (xpre , ypre , xnxt , ynxt ) denote coordinates of its neighbors
The entire procedure of H-TSP is summarized in Algo- on the partial route, otherwise they are 0. The boolean vari-
rithm 1. It starts with an initial solution containing the depot, able mselect indicates whether this node has been visited or
and then inserts the node nearest to it as two fixed endpoints not.
of the first sub-problem. The upper-level policy is responsi- For a TSP instance with N nodes, we have a tensor of
ble for decomposing the original problem and merging sub- size (N, D), where D = 11 is the number of features.
solutions from the lower-level policy. As decomposition will Then this tensor is processed by a linear layer to generate
inevitably downgrade quality of final solutions, to alleviate a (N, C) sized high dimensional tensor. According to the
this we let the upper-level policy learn to generate a decom- divided clusters, we use a max operation over the C dimen-
position strategy in an adaptive and dynamic manner. On sion to get the feature of each grid, and we use zero padding
the other hand, once a sub-problem is identified, it will be for the empty grids. The combination of each grid forms
9347
a pseudo-image: a tensor of size (H, W, C). This pseudo- Algorithm 2: Sub-problem Generation
image can be further processed by a convolutional neural
network (CNN), resulting in an embedding vector of the Input: k-NN graph GkN N =(V, E), the partial
whole TSP instance for the DRL model. solution at step t τt = vt1 , vt2 , · · · , length of
The DRL model follows the actor-critic architecture, there the sub-problem subLength, maximum
is a policy head for the policy function and a value head for number of unvisited nodes maxN um, upper
the state value function. Both of the two heads are composed layer model U pperM odel
of fully connected layers and activation functions. Output: Sub-problem

P = vs1 , vs2 , · · · , vssubLength , two
Sub-Problem Generation and Merging The action space endpoints vs , vt ∈ P
of our upper-level policy is continuous with 2 dimensions, 1 P ← ∅, Sv ← τt , Qnew ← Deque();
each of which denotes coordination of a point in the grid. 2 U pperM odel inputs GkN N and τt , outputs
We illustrate in this paragraph how a sub-problem is gener- Coordpred ;
ated and merged given an upper-level policy. The procedure 3 vc is the unvisited node closest to Coordpred ;
is depicted in Algorithm 2. For a given action Coordpred , 4 vb is the visited node closest to vc ;
let vc be the node closest to Coordpred that has not been 5 Push vb to the end of Qnew ;
visited yet. We further let vb be the node closed to vc 6 while len(P ) ≤ maxN um and Qnew is not empty
that has already been visited. Then, we keep expanding the do
sub-problem by selecting nodes that have not been visited 7 vi ← P opF ront(Qnew ) ;
from neighbors of vb based on a k-Nearest Neighbor (k- 8 for vj ∈ NGkN N (vi ) and vj ∈ / Sv do
NN) graph, until size of the sub-problem reaches maxN um 9 Push vj to the end of Qnew ;
or all nodes are visited or selected. Intuitively, the k-NN 10 Add vj into Sv and P ;
graph is implemented by associating each node a set of k 11 end
closest nodes and the sub-problem expansion will follow a 12 end
breadth-first search on the k-NN graph. Finally, we enrich 13 oldLength = subLength − len(P ) ;
the sub-problem (SelectF ragment) with a fragment of vis- 14 Pt ← SelectF ragment(τt , vb , oldLength) ;
ited nodes centering at vb so that the resulting sub-problem 15 P ← P ∪ Pt ;
has nodes not greater than subLength. In this way, we break 16 vs , vt ← SetEndpoints(Pt ) ;
the existing partial route to obtain a path with two endpoints, 17 return P, vs , vt ;
while after solving the sub-problem as a open-loop TSP, we
would obtain another path with two endpoints.
Markov Decision Process Now we shall be able to intro- Neural Network The underlying neural network of our
duce underlying MDPs of upper-level policies formally in lower-level model is a Transformer network, which has been
this paragraph. Let MG = ⟨S, A, P, R, γ⟩ denote an MDP widely used in the fields of natural language processing and
modelling a given TSP instance G = (V, E), where computer vision in recent years. It consists of a Multi-Head
• S is the set of all states containing all possible path frag- Attention and a Multi-Layer Perceptron layer, with a mask
ments τ of G; mechanism to remove all invalid actions.
• A = [0, 1] × [0, 1] is the set of all actions containing all Our neural network follows the encoder-decoder struc-
points in a unit grid; ture, where the encoder uses self-attention layers to encode
• P : S ×A → S is a deterministic transition function given the input node sequence, while the decoder outputs a se-
both upper-level and lower-lever policies; quence of nodes in an auto-regressive manner. In approaches
• R : S × A × S → R is the reward function defined by presented in (Kool, Van Hoof, and Welling 2019; Kwon et al.
R(τ, a, τ ′ ) = L(τ ′ ) − L(τ ), where τ is the current partial 2020), the following context is used as input of the encoder,
route, and τ ′ is the previous partial route;
• γ is the discount factor, which is set to 1 in our experi- qcontext = qgraph + qf irst + qlast (2)
ments.
where qgraph , qf irst , and qlast represent the feature vec-
Lower-Level Model tors of the whole graph, the first node and the last node of
The lower-level model is trained for solving open-loop TSPs the current partial solution, respectively. While enough for
with fixed endpoints generated by the upper-level model. As TSPs, it is inadequate for open-loop TSPs where we shall
lower-level policies will be launched for many times during keep in mind that there are two fixed endpoints. Therefore,
training and interference, its performance will have a sig- we add two more vectors to encode the features of the two
nificant impact on the performance of our approach. For- endpoints, namely, input of our encoder is a context vector
tunately, there have been many end-to-end approaches that defined as follows:
can solve relatively small-scale TSPs effectively and effi- qcontext = qgraph + qf irst + qlast + qsource + qtarget (3)
ciently (Kool, Van Hoof, and Welling 2019; Kwon et al.
2020). We adopt main ideas of these approaches to devise The POMO approach introduced in (Kwon et al. 2020)
an efficient lower-level policy, which we will briefly illus- takes advantage of the symmetry property of TSPs, which
trate in this section. improves its performance considerably. Although open-loop
9348
TSPs do not have the same symmetry property due to ex- Lower-Level Model
isting of fixed endpoints, we can achieve such a symme- The lower-level model is an end-to-end model for solving
try property easily as follows: During the node selection, all open-loop TSPs with relatively a small amount of nodes. It
nodes except the endpoints will be treated as in TSP without is trained by the classic REINFORCE (Williams 1992) al-
any constraint. Whenever an endpoint is chosen, we let the gorithm with a shared baseline, as in (Kool, Van Hoof, and
other one be chosen automatically. The final solution of the Welling 2019; Kwon et al. 2020). The REINFORCE algo-
origin open-loop TSP is obtained by removing the redundant rithm collects experience by Monte Carlo sampling and the
edge between the two endpoints. policy gradient is computed as follows:
Markov Decision Process The underlying MDPs of the
lower-level policies can be defined similarly as in (Kool, Van
∇θ J(θ) = Eπθ [∇θ log πθ (τ | s)Aπθ (τ )]
Hoof, and Welling 2019; Kwon et al. 2020), where PN (9)
i
≈ N1 − b(s) ∇θ log πθ τ i | s

• States: The states contain all possible contexts defined as i=1 R τ
in Eq. (3);
• Action: The actions contain all nodes in a TSP, with dy- where τ denotes a trajectory, namely, a feasible solution of
namic masks to remove nodes that have been visited; a TSP instance. The reward R(τ i ) = −L(τ i ) is defined as
• Rewards: A reward equating the negative cost of a route the negative cost of τ i . The shared baseline b(s) is used to
is assigned whenever a state corresponding to a feasible so- reduce the variance and improve the training stability, which
lution is encountered; otherwise the reward is 0. is obtained by averaging the return of a set of trajectories that
are generated from the same instance:
Training PN
b(s) = N1 i=1 R τ i

(10)
The proposed framework are trained by a hierarchical DRL
algorithm. More specifically, the two levels of models are Joint Training
trained with DRL jointly.
In order to improve the performance of the upper-level
Upper-Level Model and lower-level models, we adopt a joint training strategy.
The upper-level model is trained by the known Proximal Specifically, the current lower-level policy will be used to
Policy Optimization (PPO) (Schulman et al. 2017) algo- collect trajectories for training the upper-level model, and in
rithm, which is one of SOTA DRL algorithms based on an the meanwhile, sub-problems generated by the upper-level
actor-critic architecture. It learns a stochastic policy by min- policy will in turn be stored to train the lower-level model.
imizing the following clipped objective function: By such an interleaving training procedure, policies in two
h i levels can receive instant feedback from each other, hence
L(θ) = Êt min rt (θ)Ât , clip (rt (θ), 1 − ϵ, 1 + ϵ) Ât make the learning of a cooperative policy possible.
(4) As mentioned before, solution quality of lower-level poli-
where rt (θ) = ππθ θ (a(at |s t)
t |st )
denotes the probability ratio of cies has a significant impact on the final solution. If we start
old from a random lower-level policy, the upper-level policy
two policies, Ât denotes the advantage function, ϵ is a hy- would receive much misleading feedback making its train-
perparameter controlling the clipping range. The advantage ing hard to converge. To alleviate it, we introduce a warm-up
function represents the advantage of the current policy over stage for the lower-level model by pre-training it with sub-
the old policy, here we use the Generalized Advantage Esti- problems randomly generated from the original TSP. Ac-
mator (GAE) to compute the advantage. cording to our experiment, such a warm-up stage will ac-
celerate convergence and make the training more stable.
P∞ l
Ât = l=1 (γλ) r t + γ V̂ (s t+l+1 ) − V̂ (s t+l ) (5)
Experiments
where rt is the reward at time t, V̂ is the state value function,
γ is the discount factor, and λ is the hyper-parameter that To demonstrate how our approach works on the large-scale
controls the compromise between bias and variance of the TSP problem, we adopt four datasets to evaluate it. The
estimated advantage. four datasets contains TSP instances with problem sizes
Besides the policy loss, we also add the value loss and of 1000, 2000, 5000, and 10000 nodes, denoted as Ran-
entropy loss: dom1000, Random2000, Random5000, and Random10000,
h i2 respectively. To make experiment results comparable, Ran-
LV̂ (θ) = Êt V̂ − V̂θ (6) dom1000 and Random10000 contain the same instances
used by Fu et al. in their work (Fu, Qiu, and Zha 2021),
LE (θ) = Êt [πθ (a | s) log πθ (a | s)] (7) while instances in Random2000 and Random5000 are gen-
The total loss of upper-level model is: erated with nodes that are uniformly distributed in a unit
square, in line with existing approaches. Each dataset con-
LU P P ER (θ) = λp L(θ) + λv LV̂ (θ) − λe LE (θ) (8) tains 16 TSP instances except Random1000, which contains
where λv is the weight of value loss, and λe is the weight 128 instances. All our experiment results were obtained on
of entropy loss for balancing the policy’s exploration and a machine with an NVIDIA® Tesla V100 (16GB) GPU and
exploitation. Intel(R) Xeon(R) Platinum CPU.
9349
Hyper-Parameters Setting The upper-level model con- Random1000 Random2000
sists of a pixel encoder and a DRL agent model. We use Algorithm
Length Gap Time Length Gap Time
a 3-layer CNN for the pixel encoder with 16, 32, 32 chan- (%) (s) (%) (s)
nels respectively, and it outputs a 128 dimensional feature
vector. And our DRL model with actor-critic architecture Concorde 23.12 0.00 487.89 32.48 0.00 7949.97
LKH-3 23.16 0.17 22.01 32.64 0.49 79.75
consists of an actor network and a critic network, each of
OR-Tools 24.23 4.82 104.34 34.04 4.82 532.14
them is a 4-layer MLP. The lower-level model follows the POMO 30.52 32.01 4.28 46.49 43.15 35.89
encoder-decoder structure, there is a 12-layer self-attention DRL-2opt 37.90 63.93 55.56 115.59 255.92 827.43
encoder and a 2-layer context-attention decoder. Most of the Att-GCN
embedding dimension in the neural network is set to 128 ex- 23.86 3.22 5.85 33.42 2.91 200.28
+MCTS
cept for the CNN layers, the first encoding layer and the out- H-TSP 24.65 6.62 0.33 34.88 7.39 0.72
putting layer. During training, we use the AdamW optimizer
with a learning rate of 1e-4 and a weight decay of 1e-6. For Random5000 Random10000
the sub-problem generation stage, we set k = 40 for the Algorithm
Length Gap Time Length Gap Time
k-nearest neighbor and set the sub-problem length as 200 (%) (s) (%) (s)
and the maximum number of new nodes in sub-problem as
190. The lower-level model is trained for 500 epochs in the LKH-3 51.36 0.00 561.74 72.45 0.00 4746.59
OR-Tools 53.35 3.86 5368.24 74.95 3.44 21358.66
warm-up stage, and the joint training stage takes 500, 1000, POMO 80.79 57.29 575.63 OOM OOM OOM
1500, 2000 epochs respectively for different datasets. Our DRL-2opt 754.91 1369.76 2308.48 2860.86 3848.66 6073.43
algorithm is implemented based on PyTorch (Paszke et al. Att-GCN
2019), the trained models and related data are publicly avail- 52.83 2.86 377.47 74.93 3.42 395.85
+MCTS
able. 1 H-TSP 55.01 7.10 1.66 77.75 7.32 3.32
Baselines We apply the following six SODA TSP solvers

Table 1: Comparisons with seven solvers on large scale TSP
for comparison. (1) Concorde (David L. Applegate et al.
2007) is one of the SOTA exact solvers for TSP. (2) LKH-
3 (Helsgaun 2017) is one of the SOTA heuristic solvers for
TSP. (3) OR-Tools2 is an operational problem solver re- short time even real-time. It is worth noting that POMO and
leased by Google. It has a wide range of applications and can DRL-2opt perform poorly in all experiments, because they
solve a variety of combinatorial optimization problems such cannot be trained directly on large-scale graphs, and mod-
as TSP, VRP, and packing problems. (4) POMO (Kwon els trained on small-scale graphs cannot be generalized to
et al. 2020) is an end-to-end DRL-based TSP algorithm, large-scale graphs as well. It is precisely the shortcomings
and its performance is comparable to the SOTA methods. of these two types of methods that motivate us to propose
(5) DRL-2opt (da Costa et al. 2020) is a search-based DRL a new approach. For the three baselines, LKH-3, OR-Tools
algorithm, and one of the SOTA method in this category. and Att-GCN+MCTS, that have better solution quality than
(6) Att-GCN+MCTS (Fu, Qiu, and Zha 2021) is a novel H-TSP, we have attempted to limit their searching time to
method that combines GCN model trained with supervised the same magnitude as H-TSP. However, these methods re-
learning and MCTS searching to solve large-scale TSP. It quire at least ten times as much time as H-TSP in order to
can solve TSP instances with up to 10,000 nodes at the cost generate a feasible solution. Furthermore, we trained four
of a long searching time. models with different scales of TSP instances, denoted as
Model1000, Model2000, Model5000 and Model10000, re-
spectively. In order to demonstrate the generalization abil-
Comparative Study
We conduct comparative study on four randomly distributed
Mo Mo Mo Mo
TSP datasets. The experimental results are shown in Table 1. de de de de
l10 l20 l50 l10
00 00 00 00
The time in tables is the average time required for each in- 0
8.50
stance, and Concorde is not tested on Random5000 and Ran- Random 1000 6.62% 7.44% 8.03% 7.72%
dom10000 as the exact solver takes too much time. H-TSP 8.25
achieves comparable results to the SOTA methods in terms Random 2000 8.45% 7.40% 7.77% 8.26% 8.00
Optim ality Gap / %
of solution quality. The length of route generated by H-TSP

is close to the search-based approach Att-GCN+MCTS and Random 5000 8.41% 7.88% 7.10% 7.62% 7.75
much shorter than the two DRL-based algorithms. More- 7.50

Random 10000 8.51% 7.84% 7.52% 7.32%
over, H-TSP outperforms all baselines in terms of efficiency,
7.25
the computation time of H-TSP is one to two magnitude
Random 20000 8.27% 7.79% 7.63% 7.38%
less than the baseline algorithms. The high efficiency in- 7.00
dicates that H-TSP has significant potential in many real- Random 50000 7.66% 7.07% 7.03% 6.83% 6.75
world scenarios that require solving large-scale TSP in a
1
https://github.com/Learning4Optimization-HUST/H-TSP Figure 1: The optimality gaps of models tested on datasets
2
OR-Tools. https://developers.google.com/optimization/ out of training distribution
9350
12 Random1000 Random2000
DRL Upper DRL Lower
250 Algorithm
10 Random Upper Heurist ic Lower Gap Time Gap Time
Opt im alit y gap / %
Opt im alit y gap / %

200 (%) (s) (%) (s)
8
H-TSP 6.62 0.33 7.39 0.72
6 150
H-TSP with LKH-3 4.06 3.53 5.01 6.88
100
Att-GCN+MCTS 3.22 5.85 2.91 200.28
4
2 50 Random5000 Random10000
Algorithm
0 0
Gap Time Gap Time
TSP1000 TSP2000 TSP5000TSP10000 TSP1000 TSP2000 TSP5000TSP10000 (%) (s) (%) (s)
Num ber of nodes Num ber of nodes
H-TSP 7.10 1.66 7.32 3.32
H-TSP with LKH-3 5.10 15.12 5.57 27.94
Figure 2: Optimality gap of models with different upper- Att-GCN+MCTS 2.86 377.47 3.42 395.85
level and lower-level model
Table 2: Comparison of H-TSP and its variant with LKH-3
ity of H-TSP, we test the four trained models on randomly
generated datasets with different number of nodes. Figure 1 Gap (%) ∆ Gap (%)
shows that H-TSP has a good generalization performance
H-TSP 6.76 0.00
with TSP instances from 1000 nodes to 50000 nodes. Note
w/o visited fragment 7.42 +0.66
that the optimality gap on Random50000 is smaller than the
w/o k-NN 7.69 +0.93
gap of Random20000, because the optimal solutions of these
w/o joint training 7.53 +0.77
datasets are generated by LKH-3, and the solution quality of
w/o warm-up 27.05 +20.29
LKH-3 also declines as the number of nodes increases.
Ablation Study Table 3: Analysis of four critical techniques

Effect of Two Levels of Models The H-TSP framework
consists of two levels of models, and both models are trained
by DRL. We conduct an ablation study during the infer- eration process and our training strategy. “Visited fragment”
ence stage to demonstrate the effect of each level by re- refers to the fragment of visited nodes in sub-problem gen-
placing each of them with other heuristic methods. For eration, and “k-NN” means the k-NN graph. “Joint train-
the upper-level model, we simply use a random policy to ing” and “warm-up” are the previously proposed training
generate coordinates, starting which a sub-problem can be strategies. Due to time limitations, we train the five mod-
generated. For the lower-level model, we introduce a sim- els with TSP instances of 1000 nodes for 250 epochs, and
ple but effective constructive heuristic called Farthest In- test them on Random1000. Table 3 shows the test results,
sertion (Rosenkrantz, Stearns, and Lewis 1974). We eval- where all four techniques are helpful in improving overall
uate four combinations on the four datasets. The experimen- performance. The warm-up stage of lower-level model has
tal results in Figure 2 show that our DRL-based framework the greatest influence on the performance, as the training of
outperforms other alternatives in all experiments. Moreover, models on both levels become slow and unstable if starting
the effect of lower-level model is more significant than the with a poor lower-level policy.
upper-level model. Thus we further investigate this effect by
combining LKH-3 with H-TSP, which is one of the SOTA Conclusion
heuristic TSP solvers. LKH-3 is capable of solving nu-
In this paper, we propose a hierarchical deep reinforcement
merous TSP variants including our sub-problem. We sim-
learning framework for the large-scale TSP, named H-TSP,
ply combine the trained upper-level model and the LKH-3
which solves TSP in a divide-and-conquer manner. We test
solver as the lower-level model.
H-TSP on four datasets with different numbers of nodes,
The experimental results are shown in Table 2. H-TSP
and the results show that H-TSP outperforms other SOTA
with LKH-3 achieves better solution quality than the original
baselines in terms of efficiency, while the solution qual-
H-TSP, but it takes longer solving time as LKH-3 is a search-
ity of H-TSP remains comparable with the SOTA learning-
based method that consumes more time than constructive
based methods. The ablation studies show that the lower-
methods. However, if compared with Att-GCN+MCTS, H-
level model significantly affects the performance of H-TSP,
TSP with LKH-3 still has higher efficiency, especially on
and the solution quality can be further improved by replac-
larger TSPs. This indicates that we can use the DRL-based
ing the lower-level model with SOTA LKH-3 solver dur-
lower-level model to speed up the training speed and replace
ing inference, allowing us to achieve a flexible trade-off be-
it with LKH-3 solver during inference, which makes H-TSP
tween efficiency and solution quality. Furthermore, we be-
of better practical value.
lieve that the divide-and-conquer method can be generalized
Sub-Problem Generation and Training Strategy We to other large-scale problems such as vehicle routing and job
conduct four ablation experiments on the sub-problem gen- scheduling, and we leave these as our future works.
9351
References Large-Scale Dynamic Pickup and Delivery Problems. In
Applegate, D. L.; Bixby, R. E.; Chvátal, V.; Cook, W.; Es- Ranzato, M.; Beygelzimer, A.; Dauphin, Y.; Liang, P.; and
pinoza, D. G.; Goycoolea, M.; and Helsgaun, K. 2009. Cer- Vaughan, J. W., eds., Advances in Neural Information Pro-
tification of an optimal TSP tour through 85,900 cities. Op- cessing Systems, volume 34, 23609–23620. Curran Asso-
erations Research Letters, 37(1): 11–15. ciates, Inc.
Bello, I.; Pham, H.; Le, Q. V.; Norouzi, M.; and Bengio, Mariescu-Istodor, R.; and Fränti, P. 2021. Solving the Large-
S. 2017. Neural combinatorial optimization with reinforce- Scale TSP Problem in 1 h: Santa Claus Challenge 2020.
ment learning. 5th International Conference on Learning Frontiers in robotics and AI, 8: 689908–689908.
Representations, ICLR 2017 - Workshop Track Proceedings. Nowak, A.; Villar, S.; Bandeira, A. S.; and Bruna, J. 2017.
Bengio, Y.; Lodi, A.; and Prouvost, A. 2021. Machine A Note on Learning Algorithms for Quadratic Assign-
learning for combinatorial optimization: A methodological ment with Graph Neural Networks. ArXiv e-prints, 1706:
tour d’horizon. European Journal of Operational Research, arXiv:1706.07450.
290(2): 405–421. Padberg, M.; and Rinaldi, G. 1991. A Branch-and-Cut Al-
da Costa, P. R. d. O.; Rhuggenaath, J.; Zhang, Y.; and Akcay, gorithm for the Resolution of Large-Scale Symmetric Trav-
A. 2020. Learning 2-Opt Heuristics for the Traveling Sales- eling Salesman Problems. SIAM review, 33(1): 60–100.
man Problem via Deep Reinforcement Learning. In Pan, Papadimitriou, C. H. 1977. The Euclidean travelling sales-
S. J.; and Sugiyama, M., eds., Proceedings of the 12th Asian man problem is NP-complete. Theoretical Computer Sci-
Conference on Machine Learning, volume 129 of Proceed- ence, 4: 237–244.
ings of Machine Learning Research, 465–480. PMLR.
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.;
Dai, H.; Khalil, E. B.; Zhang, Y.; Dilkina, B.; and Song, L.
Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga,
2017. Learning combinatorial optimization algorithms over
L.; Desmaison, A.; Kopf, A.; Yang, E.; DeVito, Z.; Raison,
graphs. Advances in Neural Information Processing Sys-
M.; Tejani, A.; Chilamkurthy, S.; Steiner, B.; Fang, L.; Bai,
tems, 2017-December: 6349–6359.
J.; and Chintala, S. 2019. PyTorch: An Imperative Style,
David L. Applegate; Robert E. Bixby; Vašek Chvátal; and High-Performance Deep Learning Library. In Wallach, H.;
William J. Cook. 2007. The Traveling Salesman Problem: A Larochelle, H.; Beygelzimer, A.; d'Alché-Buc, F.; Fox, E.;
Computational Study. Princeton University Press. and Garnett, R., eds., Advances in Neural Information Pro-
Fu, Z.-H.; Qiu, K.-B.; and Zha, H. 2021. Generalize a Small cessing Systems, volume 32. Curran Associates, Inc.
Pre-trained Model to Arbitrarily Large TSP Instances. Pro- Rosenkrantz, D. J.; Stearns, R. E.; and Lewis, P. M. 1974.
ceedings of the AAAI Conference on Artificial Intelligence, Approximate algorithms for the traveling salesperson prob-
35(8): 7474–7482. lem. In 15th Annual Symposium on Switching and Automata
Ghiani, G.; Guerriero, F.; Laporte, G.; and Musmanno, R. Theory (swat 1974), 33–42. IEEE.
2003. Real-time vehicle routing: Solution concepts, algo-
rithms and parallel computing strategies. European Journal Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and
of Operational Research, 151(1): 1–11. Klimov, O. 2017. Proximal policy optimization algorithms.
arXiv preprint arXiv:1707.06347.
Guo, T.; Han, C.; Tang, S.; and Ding, M. 2019. Solving
Combinatorial Problems with Machine Learning Methods. Taillard, É. D.; and Helsgaun, K. 2019. POPMUSIC for the
Springer Optimization and Its Applications, 147: 207–229. travelling salesman problem. European Journal of Opera-
tional Research, 272(2): 420–429.
Helsgaun, K. 2017. An Extension of the Lin-Kernighan-
Helsgaun TSP Solver for Constrained Traveling Salesman Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
and Vehicle Routing Problems. Roskilde: Roskilde Univer- L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-
sity. tention is all you need. Advances in Neural Information Pro-
Kool, W.; Van Hoof, H.; and Welling, M. 2019. Attention, cessing Systems, 2017-December: 5999–6009.
learn to solve routing problems! 7th International Confer- Vinyals, O.; Fortunato, M.; and Jaitly, N. 2015. Pointer Net-
ence on Learning Representations, ICLR 2019. works. In Cortes, C.; Lawrence, N.; Lee, D.; Sugiyama, M.;
Kwon, Y. D.; Choo, J.; Kim, B.; Yoon, I.; Gwon, Y.; and and Garnett, R., eds., Advances in Neural Information Pro-
Min, S. 2020. POMO: Policy optimization with multiple cessing Systems, volume 28 of NIPS’15. Cambridge, MA,
optima for reinforcement learning. Advances in Neural In- USA: Curran Associates, Inc.
formation Processing Systems, 2020-December. Williams, R. J. 1992. Simple Statistical Gradient-Following
Lang, A. H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; and Algorithms for Connectionist Reinforcement Learning. Ma-
Beijbom, O. 2019. PointPillars: Fast Encoders for Object chine learning, 8(3): 229–256.
Detection From Point Clouds. In 2019 IEEE/CVF Confer- Xu, Z.; Li, Z.; Guan, Q.; Zhang, D.; Li, Q.; Nan, J.; Liu,
ence on Computer Vision and Pattern Recognition (CVPR), C.; Bian, W.; and Ye, J. 2018. Large-scale order dispatch
12689–12697. in on-demand ride-hailing platforms: A learning and plan-
Ma, Y.; Hao, X.; Hao, J.; Lu, J.; Liu, X.; Xialiang, T.; Yuan, ning approach. Proceedings of the ACM SIGKDD Interna-
M.; Li, Z.; Tang, J.; and Meng, Z. 2021. A Hierarchical Re- tional Conference on Knowledge Discovery and Data Min-
inforcement Learning Based Optimization Framework for ing, 905–913.
9352
Zheng, J.; He, K.; Zhou, J.; Jin, Y.; and Li, C.-M. 2021.
Combining Reinforcement Learning with Lin-Kernighan-
Helsgaun Algorithm for the Traveling Salesman Problem.
Proceedings of the AAAI Conference on Artificial Intelli-
gence, 35(14): 12445–12452.
9353

26120-Article Text-30183-1-2-20230626

Uploaded by

Copyright:

Available Formats

26120-Article Text-30183-1-2-20230626

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

26120-Article Text-30183-1-2-20230626

Uploaded by

Copyright:

Available Formats

The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23)

H-TSP: Hierarchically Solving the Large-Scale Traveling Salesman Problem

Abstract stances. This limits their applications in scenarios that are

Baselines We apply the following six SODA TSP solvers

of solution quality. The length of route generated by H-TSP

much shorter than the two DRL-based algorithms. More- 7.50

Opt im alit y gap / %

Ablation Study Table 3: Analysis of four critical techniques

You might also like