Intelligent Scheduling of Discrete Automated Production Line Via Deep Reinforcement Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/338842200

Intelligent scheduling of discrete automated production line via deep


reinforcement learning

Article  in  International Journal of Production Research · January 2020


DOI: 10.1080/00207543.2020.1717008

CITATIONS READS

50 843

5 authors, including:

Daming Shi Tingyu Lin


Tsinghua University I-Shou University
5 PUBLICATIONS   59 CITATIONS    34 PUBLICATIONS   400 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Parallel Simulation Language Modeler and Compiler View project

All content following this page was uploaded by Daming Shi on 25 March 2021.

The user has requested enhancement of the downloaded file.


International Journal of Production Research

ISSN: 0020-7543 (Print) 1366-588X (Online) Journal homepage: https://www.tandfonline.com/loi/tprs20

Intelligent scheduling of discrete automated


production line via deep reinforcement learning

Daming Shi, Wenhui Fan, Yingying Xiao, Tingyu Lin & Chi Xing

To cite this article: Daming Shi, Wenhui Fan, Yingying Xiao, Tingyu Lin & Chi Xing (2020):
Intelligent scheduling of discrete automated production line via deep reinforcement learning,
International Journal of Production Research, DOI: 10.1080/00207543.2020.1717008

To link to this article: https://doi.org/10.1080/00207543.2020.1717008

Published online: 27 Jan 2020.

Submit your article to this journal

Article views: 167

View related articles

View Crossmark data

Citing articles: 2 View citing articles

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=tprs20
International Journal of Production Research, 2020
https://doi.org/10.1080/00207543.2020.1717008

Intelligent scheduling of discrete automated production line via deep reinforcement learning
Daming Shia∗ , Wenhui Fana , Yingying Xiaob , Tingyu Linb and Chi Xingb
a Department of Automation, Tsinghua University, Beijing, People’s Republic of China; b State Key Laboratory of Intelligent
Manufacturing System Technology, Beijing Institute of Electronic System Engineering, Beijing, People’s Republic of China
(Received 6 April 2019; accepted 9 January 2020)

The reinforcement learning (RL) is being used for scheduling to improve the adaptability and flexibility of an automated
production line. However, the existing methods only consider processing time certain and known and ignore production line
layouts and transfer unit, such as robots. This paper introduces deep RL to schedule an automated production line, avoiding
manually extracted features and overcoming the lack of structured data sets. Firstly, we present a state modelling method
in discrete automated production lines, which is suitable for linear, parallel and re-entrant production lines of multiple
processing units. Secondly, we propose an intelligent scheduling algorithm based on deep RL for scheduling automated
production lines. The algorithm establishes a discrete-event simulation environment for deep RL, solving the confliction
of advancing transferring time and the most recent event time. Finally, we apply the intelligent scheduling algorithm into
scheduling linear, parallel and re-entrant automated production lines. The experiment shows that our scheduling strategy
can achieve competitive performance to the heuristic scheduling methods and maintains stable convergence and robustness
under processing time randomness.
Keywords: intelligent scheduling; deep reinforcement learning; discrete-event simulation

1. Introduction
Discrete automated production lines are designed to produce complex products where transfer units grasp and transfer
products, while processing units process products. The automated transferring of raw material input and semi-finished
output raises the issue of task allocation to the processing units and transferring units simultaneously in the schedule. Most
existing researches assume processing time is determined and prove single-transfer-unit scheduling is an NP hard problem.
Typical methods include mathematical programming, Gantt diagram, petri net and branch-and-bound (Li, Lu, and Zang
2016).
Discrete automated production lines consist of multiple processing, assembling, detecting machines and a multi-degree
mechanical robot transferring products. Each equipment has high complexity of assembly or position detecting; their actual
processing time is stochastic. Therefore, the scheduling system is unaware of the exact processing time of each process, and
offline global scheduling is not feasible.
In this paper, we present an online RL-based scheduling method and verify that such strategy is robust to stochastic
processing time. The scheduling method is based on deep reinforcement learning (RL), efficient and stable for scheduling
discrete automated production lines.
Empirically, both supervised learning and unsupervised learning in machine learning (ML) need data set for training.
However, it is difficult and prohibitive to label sufficient scheduling data set manually. While RL in ML enables control
system to optimise policy in exploration, with none initial given data or policy. Deep RL used to deal with constant event
task, such as Atari, Go. Nevertheless, scheduling for discrete automated production lines is a typical discrete-event task.
Thus, in order to enable scheduling system (agent in RL) to learn efficient policy, we build up an accurate and effective
simulation environment for RL.
The remainder of the paper is organised as follows. Section 2 introduces related works. Section 3 presents the deep
RL-based intelligent scheduling method. Section 4 conducts an experiment and makes analysis of scheduling linear, parallel
and re-entrant discrete automated production lines based on deep RL. The fifth section concludes and shows our future
work.

*Corresponding author. Email: [email protected]

© 2020 Informa UK Limited, trading as Taylor & Francis Group


2 D. Shi et al.

2. Related works
Many researchers have resorted to RL for scheduling policy research. Zhang and Dietterich (1995, 1996) first apply RL on
job-shop scheduling, using temporal-difference TD (λ) to cope with NASA space shuttle load handling, better than iterative
repair method by Zweben et al. (1993). Riedmiller and Riedmiller (1999) use RL for job-shop local dispatch policy on
single and three-resource line, better than heuristic scheduling. Paternina-Arboleda and Das (2005) research single server
and multiple products to obtain dynamic control policy with RL, yet, the state-action space becomes 1.8*107 . Gabel and
Riedmiller (2007) propose a partial observable job-shop scheduling policy with RL. Zhang et al. (2012) explores n-task-
m-machine scheduling problem as reinforcement model to choose within four heuristic methods, obtaining a good online
scheduling system. Palombarini and Martínez (2012) propose the automated generation and update through learning of
rescheduling knowledge. Stricker et al. (2018) uses Q-learning to obtain online scheduling system, considering the influence
of production complexity and processing detail. Waschneck (2018a, 2018b) uses deep RL on multi-agent scheduling. Shiue,
Lee, and Su (2018) use Q-learning to design a multi-rule online scheduling based on MDR.
Moreover, researchers bring in RL for randomness of production and scheduling. Li, Wang, and Sawhney (2012)
design a Q-learning method to decide, accept or refuse tasks in an infinite planning horizon with stochastic demands.
Shin et al. (2012) applies RL for autonomous goal-formation. Lin and Yao (2015) utilise a multi-state Q-learning approach
on load balancing issue. Shahrabi, Adibi, and Mahootchi (2017) consider randomness of task arrival and machine break-
down with Q-factor RL. Kara and Dogan (2018) deal with the inventory management system under the random demand and
deterministic lead time with RL.
Generally, there are two weaknesses in current works:
Firstly, the current discrete automated production line scheduling methods focus on processing units and products rather
than transferring consumption. Stricker et al. (2018) and Kuhnle, Röhrig, and Lanza (2019) consider specific layout and
study transport resources as main point. However, the scope of most intelligent scheduling algorithm still focuses on
processing units simply. Therefore, considering transferring motions in a discrete automated production line is significant.
Secondly, the current adaptive or intelligent scheduling mainly focuses on machine breakdown and task arrival, rather
than the randomness of processing time. Although Waschneck (2018a, 2018b) consider stochastic processing times, the
robustness to stochastic processing time is still scarce. Therefore, it is necessary to research influence on scheduling policy
with randomness of processing time.
In contrast, on the one hand, our proposed RL-based scheduling focused scheduling of transferring units rather than
processing units. On the other hand, the algorithm of RL-based scheduling combines discrete-event simulation (DES) and
deep RL, verifying robustness to randomness of processing time.

3. Deep RL-based scheduling


The scope of scheduling in this paper is a discrete automated production line, whose transportation depends on a mechanical
robot. We assume that the production line has no buffer. It consists of n different processing units and one transfer unit.
The transfer unit (mechanical robot) can move around processing units freely and then grab or place products. The main
scheduling is to allocate how transferring unit transfers products.
When each product is transferred into the next processing unit, there will generate a stochastic processing time, however,
scheduling system is unaware of when and which process will finish. Yet, whenever a process is finished, scheduling system
will observe the transition of production line state and therefore may choose to respond and transfer the product into the
next processing unit. In all, the processing time of each process is stochastic, and scheduling system and transfer unit are
unaware of when any process will be finished.
The intelligent scheduling based on deep RL in this paper consists of deep RL agent and DES environment. As shown
in Figure 1, the agent observes the current environment state sk . sk denotes the state vector of environment in the kth
decision. Agent chooses a feasible action ak ∈ Ak = { ai |p(sk , ai ) = 0} according to current state sk and policy π . This action
ak will therefore affects environment to transmit the environment state to sk+1 ∈ Sk+1 = {sj |p(sj |s = sk , a = ak ) = 0}, i.e.
environment state will transmit to a probable state. As for discrete automated production line, the new state is exclusively
sk+1 = {s|s = sk , a = ak }. Meanwhile, environment will inform the agent of the kth decision reward rk to indicate whether
the action is advocated or punished. In this circulation of interaction, the agent observes the state, explores and estimates
the action, so that it could promote its policy π and evaluate to accumulate maximum reward sum. In this section, we will
clarify the principle of intelligent scheduling based on deep RL and simulation algorithm respectively.

3.1. Intelligent scheduling based on deep RL


Deep RL is an RL algorithm proposed by Google Deepmind. The principle of RL refers to the inspiration that high-
intelligent animals can learn from exploration and get experiences from practice, faced to unknown objects. However,
International Journal of Production Research 3

Figure 1. Interaction between deep RL agent and simulation environment.

Deepmind inferred that the applicability of RL is limited in the domain of whom features are easily extracted or states
are low-dimension and observable. Deep RL takes the method of training neural networks to let the agent perceive high-
dimension input information, achieving end-to-end RL (Mnih et al. 2015). In this paper, we focus on scheduling of discrete
automated production line, which is a task difficult to handcraft feature functions. Moreover, the state dimension could be
prohibitively high on the condition of various processing resources and product demand. Therefore, we consider deep RL
as a proper and capable machine learning method to scheduling discrete automated production lines.
The following sections introduce the state modelling for processing units, the action modelling for transfer units, the
modelling for production line environment reward and the iteration of scheduling based on deep RL.

3.1.1. State modelling for processing unit


State is the vector to capture how the production line is. State is the input vector when agent observes environment. Agent
chooses action according to current state and policy and therefore affects environment into new state. In this paper, schedul-
ing system (agent of RL) can observe all production line, so state should contain nearly all information of production lines.
In this section, we give three state modelling for discrete automated production lines:
a. busy or idle of every processing unit
In this scenario, we assume all processing units are inhomogeneous and process flow is not re-entrant. A processing unit
is idle when it has no product in it, busy when it is processing, finished when it finishes processing, then we have
si ∈ { idle , busy , finished  }, 1 ≤ i ≤ m (1)
Such m dimension vector denotes state of m processing machines. The merit of this state model is that scheduling could take
no account of which process each product is in or how many WIPs there are. Transfer unit serves for processing machine to
transfer every product from finished machine into the next processing machine.
Specifically, this state model requires no re-entrant process or it will violate Markov property. In non-re-entrant process
scenario, the scheduling action all depends on current state rather than how to reach current state. For example, the feasible
actions for state ( finished, idle , finished  ) are transferring 1 (finished) to 2 (idle) or transferring 3 (finished) out, which is
irrelevant to the last state ( busy, idle , finished  ) or ( finished, idle , busy ). Apparently, the non-re-entrant process follows
Markov property. However, for re-entrant process, the feasible action depends on not only the current state but also where
the state comes. Thus, the re-entrant process violates Markov property.
b busy or idle of every process in process flow
In this scenario, we assume process flow could be parallel or re-entrant. A process is idle when it is not processing or
occupied, busy when it is being processing, finished when it finishes processing but occupied, then we have
sj ∈ { idle , busy , finished  }, 1 ≤ j ≤ p. (2)
Such a p dimension vector denotes state of a process flow with p processes. The merit of this state model is that scenario
b allows re-entrant processing unit in production.
4 D. Shi et al.

c state of multiple processing resources


Moreover, if there are multiple homogenous processing machines, then the augmented state denotes the quality of busy,
idle or finished for processing machines:
st ∈ {idle[t], busy[t], finished[t]}, 1 ≤ t ≤ m, (3)
where idle[t] + busy[t] + finished[t] = num[t], 1 ≤ t ≤ m. Here, t denotes the tth class of processing units, where m classes
in total. Among, idle[t] denotes the quantity of idle tth processing units, busy[t] denotes the quantity of the busy tth process-
ing units, finished[t] denotes the quantity of the finished tth processing units, the sum of which num[t] is constant. Then (st )
or (sj , st ) could represents state of multiple processing units scheduling, in scenario a or scenario b.
Especially, the discrete automated production lines are processing dominant in this paper, i.e. the processing time is
much larger than transferring time, so the state vector ignores location of transfer unit. However, transferring time will also
be calculated in simulation environment.
For three state modelling methods, each may have own merit for certain scenarios. State model a is suitable for produc-
tion lines, whose processing machines are exclusively functional. Although more than one type of product are scheduled,
the state of machine is essential. Next, state b may have strength for complicated processing sequences, for each processing
step is controllable. Lastly, state c could deal with scenario where the number of processing machine is more than one. In
all, how to choose state modelling depends on which variable could be controlled more simply.

3.1.2. Action modelling of transfer unit


Action is the output of RL. Agent (transfer unit) observes environment state sk and chooses the action of highest value
ak = argmax Q(ak |sk , π ). Value function Q(a|s), representing policy, keeps being learned or optimised. To let agent fully
ak
explore combination of state set and action set, in initial episodes, RL has more chances to choose action randomly rather
than based on value function. And with learning episode of RL increasing, the probability of referring to trained value
function Q(a|s) also increases. At the last learning episode, agent schedules nearly based on trained value function so that
we could inspect the scheduling policy. Numerically, parameter ε denotes the probability of choosing action randomly. And
ε keeps decreasing from 1 to nearly 0 as learning episode increases. Technically, deep RL decreases ε in linear function.
Action set consists of all actions transferring a finished product into the post-process (or processing unit). Especially,
transferring raw material into production line and the last finished product out of line are also logic actions. However, not
all logic actions are always feasible: feasible actions should maintain that the lead process (or processing unit) is finished
and the post-process (or processing unit) is idle. The feasibility of actions, related to the reward agent receives, is also part
of policy to learn.

3.1.3. Reward modelling for production line environment


Reward is the short-term return for each scheduling step. When agent chooses ak on state sk (no matter randomly or based
on policy), environment will give a reward r. Although we do not provide training set for learning, we show our preference
and ultimate goal by reward. Rewards for different conditions are listed in Table 1).
Feasible actions are advocated. Feasible action means the transfer unit serving for a certain finished process or processing
unit whose post-process is idle. Especially, production line’s raw material is infinite, so it is feasible action if the first process
is idle. Feasible actions will change the environment state, and the reward should be positive.
Wait can drive production. Wait means not to transfer at time k. Sometimes, when processing units are busy, there could
be no feasible actions. In this circumstance, agent is allowed to wait until certain process is finished, which will be clarified
at the section of simulation. Despite positive effects, we give a tiny punishment to keep agent from sabotage. Moreover, it
is also capable to divide wait as whether drives production: useful wait is encouraged while useless wait is punished, which
is helpful for convergence and optimising speed.
Infeasible action is punished. Infeasible action means to make illegal action, such as to serve a process that is not
finished, to serve a process whose post-process is not idle. Infeasible actions should be strictly punished, so the reward of
infeasible action is negative with high absolute value.

Table 1. Reward table for different action properties.


Action property Feasible action Wait Infeasible action Complete a product
Reward property Encouragement Tiny punishment Huge punishment High encouragement
International Journal of Production Research 5

Completing a product is the goal. When agent completes a product, we give huge encouragement. This bonus indicates
we prefer agent to complete more products rather than waste time accumulating little reward.
In every episode of RL, scheduling steps are finite. Agent accumulates reward in limited steps, which shows the
behaviour of its policy. Generally, setting of reward determines what the policy RL will learn and its convergence as well.

3.1.4. Scheduling policy modelling


Value function Q(ak |sk ) denotes the comprehensive income of choosing action ak on state sk . Q-value function is defined
based on S × A, so it used to be stored in a state-action sheet in some literature study. However, since dimension of states
and actions may be prohibitively large in different production lines, we take artificial neural network (ANN) to fit value
function. The input of ANN is environment state, and output is value of each action orderly. The final online scheduling
action is ak = argmax Q(ak |sk , π ), which reflects Markov property: the optimal transferring action is only dependent on the
ak
current state.
Value function is calculated by Bellman equation iteratively. In this paper, we take Bellman equation with decay factor,
as follows:
Q(s, a) = max E[r1 + γ r2 + γ 2 r2 . . . . . . |sk = s, ak = a, π ] (4)
π

Q− (s, a, hk ) = max E[r(s, a) + γ Q− (s , a , hk )|sk = s, ak = a, π ] (5)


π

where Q− (s, a, hk ) is the value function in learning process, hk denotes Q-network coefficients in the kth iteration. The
equation shows the experience of choosing action ak on sk with a reward r(s, a). Deep RL receives value function predicted
from current neural network and the reward of environment, in order to get new state-value function until convergence
(Bellman and Dreyfus 1962).

3.1.5. Scheduling algorithm learning iteration


Deep RL stores every piece of experience and samples randomly from experience set to learn and train ANN. We denote
experience vector e = (s, a, r, s , done) as in once attempt choosing action a on state s, getting reward r and next state s .
done denotes whether this step terminates scheduling. If agent gets the goal quantity of complete products or runs out of
scheduling steps, this episode of scheduling is over. And the experience set of deep RL D = {(s, a, r, s , done)} contains all
experiences in different steps. Generally, the experiences sampled for training may come from long ago or just now, thus
deep RL is an offline learning policy.
Iteration of RL is similar to train a deep neural network. Training data is sampled as a mini-batch from experience set
D. And the objective function of optimising is as follows:
   2
Li (hi ) = E(s,a,r,s )∼U(D) r + γ max

Q− (s , a ; hi − ) − Q(s, a; hi ) (6)
a

where γ ∈ (0, 1] is decay factor, representing farsighted if close to one, while shortsighted if close to zero. Decay factor
could pass on the future rewards and values to current decision, which may lead agent to think about overall. hi is the
coefficient at iteration i. And hi − is the coefficient when agent predicts value function, which is update during training of
Q-network. Loss function of ANN converges by minimise squared difference of predicted value r + γ max 
Q− (s , a ; hi − )
a
and exact value Q(s, a; hi ) (Mnih et al. 2015; Szepesvári 2010).

3.2. DES environment and deep RL-based scheduling simulation algorithm


The environment of intelligent scheduling based on deep RL is DES environment. In contrast with DES, fixed-step simu-
lation drives simulation in constant minor step. However, if event happening time is continuous, like random value, there
are chances that two events happen in the same step and simulation will lose sequential order. Yet, shrink of the fixed-step
to increase resolution ratio could waste calculation resources. Therefore, the environment of intelligent scheduling in this
paper will be DES method. Agent only decides when state changes, at scheduling step k, rather than time t in continuous
tasks. This section will clarify intelligent scheduling environment and simulation algorithm.
The following section introduces events and event set of DES and the latter clarifies intelligent scheduling simulation
algorithm.
6 D. Shi et al.

3.2.1. Processing event set and DES


Processing event set Event = {event} is the set of events that will happen in the future. The event element is event =
(tnext , iorj), where tnext denotes this event will happen at tnext , iorj denotes this event is the ith processing unit is finished
(scenario a state) or the jth process is finished (scenario b state). Event might be null, if tnext = tsim and the second component
equals zero, i.e. null event will not drive simulation time. Here we denote tsim as the current simulation time. Especially, only
feasible actions will be added into event set, while infeasible action will not happen actually, see details in Section 3.2.2.
When a semi-finished product is transferred into processing unit i (or process j), there generates a random value tprocess
based on random distribution of this process and also an eventnew = (tnext = tsim + tprocess , iorj). This means system (event
set explicitly) knows this process will be finished at tnext , which will change environment state obviously. However, agent is
unaware of when state transition happens, which requires online scheduling robust to stochastic processing time.
If agent chooses feasible action a, environment will add a null event eventnull = (tsim , 0), which will be the very next
happening event without affecting environment. Null event gives agent a chance to choose another feasible action after one
until unnecessary, which ensures parallel processes served in approximate time. Obviously, there will not be null event if
agent chooses to wait because agent could not choose feasible action since has already chosen to wait.
Event set usually maintains in ascending sort for the sake of convenience finding the next happening event. Based on
event set of DES, we will talk about simulation algorithm in the next section.

3.2.2. Simulation algorithm for deep RL-based scheduling


In this section, we discuss about interactive simulation algorithm between DES and deep RL scheduling agent (transfer
unit), and clarify how simulation time is driven. Deep RL agent learning iterates EPISODE times in total with at most
STEP scheduling actions each. So we only talk about the k step algorithm of scheduling learning at any training episode,
where k ∈ (0, STEP], shown as Table 2.
Firstly, deep RL agent chooses k step action, according to value function and state ak = argmax Q(ak , sk ) or randomly in
ak
previous period. And environment returns this action-state reward rk = r(sk , ak ) (see discussion in Sections 3.1.3 and 3.1.4).
Secondly, if the k step action is feasible, transferring will generate transferring time ttrans and state change. The trans-
ferring time ttrans is relevant to current location and transferring action. Transferring time then forces to increase simulation
time, which will be verified in the following. And environment state changes to sk+1 = s(sk , ak ). Next, there will generate a
future event eventnew .
Thirdly, if the k step action is not wait, there will generate a null event eventnull for another chance to decide. No matter
this action is feasible or not, agent is advocated to make another decision. Apparently, eventnull is the next happening event
without changing state.
Fourthly, environment drives simulation time according to DES. The next event is popped, changing environment state.
Simulation time is driven to the maximum of new event time and current simulation time. This alteration from traditional
DES results from the transferring, which is discussed as follows.

Table 2. The k-step of intelligent scheduling based on deep RL simulation algorithm.


k-step of intelligent schedluing based on deep RL simulation algorithm (pseudo-code)
ak = argmax Q(ak , sk )
ak
rk = r(sk , ak )
If ak is FEASIBLE:
ttrans = ttrans (ak )
position = Position (position, ak )
tsim = tsim + ttrans
sk+1 = s(sk , ak )
Push eventnew into Event
Endif
If ak is not WAIT:
Push eventnull into Event
Endif
Pop eventnext from Event
sk+1 = s(eventnext )
tsim = max(tsim , tnext )
Push (ak , sk , rk , sk+1 , done) to D
k =k+1
International Journal of Production Research 7

Finally, the k step experience (ak , sk , rk , sk+1 , done) is stored for learning.
From the k-step algorithm, choosing feasible action will change environment state, and null event gives agent another
chance to choose (even infeasible). And choosing wait will drive the production, therefore changing the state. The simulation
algorithm in this paper achieves the interaction between DES and deep RL decision. Moreover, the processing time is a
random value.
In the former discussion of simulation driving, if agent chooses feasible action, the transferring time will force to
increase simulation time, which may miss the next event, i.e. tsim > tnext . As shown in the algorithm, the next event should
be responded after transferring, i.e. tsim = max(tsim , tnext ). The real production meaning is that if some process is finished
during transferring tasks, transferring will not be interrupted. System should detain the product until transferring unit is idle,
which ensures the transferring time considered without timing sequence error. We did not bring in location of transfer unit
into state, since transferring time is far less than processing time in processing dominant production line. Yet, we calculate
the transferring time in simulation, which ensures the authenticity of simulation.
In this section, the simulation algorithm for intelligent scheduling based on deep RL combines DES and deep RL,
considering transferring time, which makes intelligent scheduling simulation more authentic and complete.

4. Single-product constant processing scheduling experiment


In this section, we perform experiment of intelligent scheduling based on deep RL on single-product processing production
line. Sections 4.1–4.3 schedule linear production line with state of processing unit (state scenario a), parallel production line
and re-entrant production line with state of processes (state scenario b), respectively. Deep RL curves are drawn to verify
convergence and scheduling Gantt diagrams are demonstrated to validate scheduling efficiency. During the experiment,
randomness of processing time will be brought in to verify robustness to stochastic processing time.

4.1. Linear production line: state of processing units


In this section, scheduling object is a linear production line with three processing units, m = 3. Without loss of generality,
the processing time of each unit ln tpocess (i) ∼ N(μi , σi 2 ), 1 ≤ i ≤ m, i.e. follows log-normal distribution with mean value
μi and variance σi 2 . Due to independence of each processing unit, the processing time is also independent to each other. In
this experiment, we set (μ1 , μ2 , μ3 ) = (600, 480, 600), while variances are proportional to mean value.
According to intelligent scheduling in Section 3, we schedule linear automated production line in non-stochastic firstly.
After convergence of scheduling based on deep RL, we reproduce the last scheduling and draw the Gantt diagram in
Figure 2. The first line is transferring unit and line 2–4 is processing time of three processing units, and the processing task
circulates behind. The learning curves are shown as non-stochastic labelled curve in Figure 3. The reward sum and output
increase with RL episode and converge to the stable performance. Simulation time increases in previous period for output
is increasing and when output reaches the target goal, simulation time decreases for scheduling policy is being optimised.
Similarly, in the former half, 1000 scheduling step cannot reach target goal, so scheduling step is always 1000. And with
ability to reach target goal, scheduling steps keep decreasing with the optimisation of policy and converges at a stable policy.

Figure 2. Gantt diagram of intelligent scheduling based on deep RL on linear production line.
8 D. Shi et al.

Figure 3. In linear production line scheduling, when processing time follows log-normal distribution of same mean value and different
variances, the scheduling policy convergences are the same. (a) Reward sum, (b) total output, (c) simulation time and (d) scheduling
steps.
International Journal of Production Research 9

Then we conduct the experiment in the same condition with different randomness of processing time. There are five
experiments σi = cμi , 1 ≤ i ≤ m, where c = 60 1 1 1 1 1
, 30 , 20 , 10 , 5 . And we achieve the reward sum, output, simulation time and
scheduling steps figures in Figure 3.
According to experiments with different randomness, although processing time randomness increases, scheduling policy
based on deep RL converges at efficient behaviour, showing robustness to processing time. And especially, the simulation
time of scheduling increases in learning period and final performance because some processing resources are wasted due to
randomness to processing time, as shows in Figure 4.
Performance parameters of scheduling policy in different randomness are extracted according to Gantt diagrams. We
define processing time ratio on crucial path (PTRCP), as the ratio of the processing time sum on crucial path and the retention
time of product. Riedmiller and Riedmiller (1999), Gabel and Riedmiller (2007) and Waschneck (2018a, 2018b) compared
with heuristic method like first-in-first-out (FIFO) in verification of intelligent scheduling. Therefore, we take heuristic
method FIFO, longest processing time (LPT) and shortest processing time (SPT) as comparison, with no randomness. The
performance parameters of different policies and different randomness of intelligent scheduling based on deep RL are given
in Table 3.
Comparing different scheduling policies, it is convincing to draw the conclusion that intelligent scheduling based on
deep RL behaves close to traditional heuristic scheduling policies. Further, deep RL performs more efficiently (higher

Figure 4. With randomness of processing time increases, randomness of event happening and state transmitting increases. At time
11,500, three resources are busy and the finishing sequence is M3–M2–M1, whose bottleneck is M1. While time 12,500, still three
resources are busy but the finishing sequence is M1–M3–M2, whose bottleneck is M2. The randomness of following state transmitting
leads to waste of processing ability and influences on value estimation of policy learning.

Table 3. Linear production line scheduling performance parameter table.


Scheduling policy/ PTRCP
randomness PTRCP mean value variance M1 use ratio M2 use ratio M3 use ratio
FIFO 0.9971 3.1011*10−6 0.9727 0.7782 0.9727
LPT 0.9971 3.1011*10−6 0.9727 0.7782 0.9727
SPT 0.9971 3.1011*10−6 0.9727 0.7782 0.9727
FIFO σi =0 0.9947 3.1011*10-6 0.9727 0.7782 0.9727
FIFO σi = μi /60 0.9394 0.2726 0.9402 0.7481 0.9414
FIFO σi = μi /30 0.9310 0.6861 0.9074 0.7002 0.8681
FIFO σi = μi /20 0.8871 1.4966 0.8300 0.6471 0.8399
FIFO σi = μi /10 0.7899 4.1884 0.6639 0.5111 0.7279
RL σi =0 0.9971 3.1011*10−6 0.9727 0.7782 0.9727
RL σi = μi /60 0.9720 0.0457 0.9706 0.7784 0.9706
RL σi = μi /30 0.9574 0.0730 0.9661 0.7779 0.9699
RL σi = μi /20 0.9541 0.0912 0.9658 0.7763 0.9709
RL σi = μi /10 0.9484 0.2467 0.9463 0.7464 0.9368
RL σi = μi /5 0.9119 0.7448 0.9037 0.6882 0.9083
10 D. Shi et al.

PTRCPs and use ratios) and stably (lower variances) in random scenarios. And then comparing different randomness,
PTRCP mean values decrease and variances increase with randomness increasing. It is mainly because there are resources
wasting and senseless time waiting, and therefore processing efficiency descends (PTRCP value decreases). And since the
overall processing time of products are sum of all processing time that follows log-normal distribution independently, the
overall processing time is nearly log-normal distribution according to the additive property of log-normal function. Thus,
the processing parameter volatility increases (PTRCP variance increases).
Moreover, in analysis of mechanism of deep RL, the production efficiency decreases because randomness of sequent
event happening and following state transmitting increases as randomness increases. For example, in non-stochastic con-
a1 wait
dition, agent may learn such logic s1 → s2 → s3 . . . , where value of state s3 definitely influences on the value estimation
a1 a1
of choosing action s1 → s2 . However, as randomness increases, transition s1 → s2 will not change but the sequent state of
s2 may sometimes be s3 and sometimes s4 , as shown in Figure 4 Gantt diagram, and different sequent state will affect
a1
agent on learning value of action s1 → s2 . So due to randomness of processing time, the randomness of event happening
and sequent state transiting also increases, which makes the convergent policy not optimal for future state and therefore
generates resource wasting. In all, the main reason of policy being affected is that randomness of event happening and state
transmitting increase with randomness of processing time.

Figure 5. Overall layout of parallel production line containing seven processes.

Figure 6. Gantt diagram of intelligent scheduling based on deep RL on parallel production line. Apparently, bottleneck of this parallel
line is M4.
International Journal of Production Research 11

4.2. Parallel production line: state of processes


In this section, scheduling object is a parallel production line with seven processes, p = 7. The overall layout of this pro-
duction line is shown in Figure 5. Among, the sequent process of process 1 can be either process 2 or process 3, and the
process 6 assembles two semi-products, which is a typical process for assemble line. We set processing time of each process

Figure 7. In parallel production line scheduling, when processing time follows log-normal distribution of different variances, the
scheduling policy convergences are the same. (a) Reward sum, (b) total output, (c) simulation time and (d) scheduling steps.
12
Table 4. Parallel production line scheduling performance parameter table.
Scheduling policy/ PTRCP PTRCP M1 use M2 use M3 use M4 use M5 use M6 use M7 use
randomness mean value variance ratio ratio ratio ratio ratio ratio ratio
FIFO 0.7467 0.0256 0.5939 0.5939 0.7424 0.8908 0.4454 0.2969 0.4454
LPT 0.7467 0.0256 0.5939 0.5939 0.7424 0.8908 0.4454 0.2969 0.4454
SPT 0.7467 0.0256 0.5939 0.5939 0.7424 0.8908 0.4454 0.2969 0.4454
FIFO σi =0 0.7467 0.0256 0.5939 0.5939 0.7424 0.8908 0.4454 0.2969 0.4454

D. Shi et al.
FIFO σi = μi /50 0.7444 0.0372 0.5949 0.5822 0.7409 0.8931 0.4441 0.2912 0.4382
FIFO σi = μi /20 0.7399 0.1130 0.5821 0.5617 0.7326 0.8737 0.4191 0.2711 0.4280
FIFO σi = μi /10 0.6950 0.5962 0.5113 0.4406 0.4222 0.8512 0.3319 0.3319 0.2103
FIFO σi = μi /5 0.5378 1.9683 0.2764 0.3074 0.3109 0.6453 0.2331 0.0834 0.1371
FIFO σi = μi /2 0.4116 2.1712 0.1403 0.0439 0.0931 0.8409 0.0221 0.0283 0.0664
RL σi =0 0.5943 0.0126 0.5574 0.5574 0.6968 0.8361 0.4181 0.2787 0.4181
RL σi = μi /50 0.5730 0.0341 0.5284 0.5399 0.6999 0.8135 0.4088 0.2737 0.4021
RL σi = μi /20 0.5791 0.0903 0.4780 0.5043 0.5562 0.8430 0.3555 0.2880 0.3857
RL σi = μi /10 0.4825 0.3029 0.3795 0.4290 0.4535 0.6664 0.3120 0.2204 0.2654
RL σi = μi /5 0.3846 0.3778 0.2235 0.5425 0.4909 0.3748 0.1467 0.1327 0.2632
RL σi = μi /2 0.2280 1.0701 0.2744 0.2887 0.4308 0.3447 0.0410 0.1019 0.0971
International Journal of Production Research 13

ln tpocess (j) ∼ N(μj , σj 2 ), 1 ≤ j ≤ p and (μ1 , μ2 , μ3 , μ4 , μ5 , μ6 , μ7 ) = (100, 200, 250, 300, 150, 100, 150), while variances
are proportional to mean value.
Similarly, we schedule parallel automated production line in non-stochastic firstly and reproduce the last scheduling
to draw Gantt diagram in Figure 6. And the learning curves are shown as non-stochastic labelled curves in Figure 7. The
reward sum and output increase with RL episode and converge to a stable performance. And the scheduling steps decrease
in the latter half and converge at an optimal performance. The more vibration of the learning curves indicates complex task
requires more training resources.
Then we experiment in the same condition with different randomness of processing time. There are five experiment σj =
cμj , 1 ≤ j ≤ p, where c = 50 1 1 1 1 1
, 20 , 10 , 5 , 2 . And we get the reward sum, output, simulation time and scheduling steps figures
in Figure 7.
Similarly, performance parameters of scheduling policy in different randomness are extracted and we take heuristic
method FIFO, LPT and SPT as comparison, as given in Table 4 (the crucial path is M1–M2–M4–M6–M7).
Comparing different scheduling policies, intelligent scheduling based on deep RL behaves close to traditional heuristic
methods. The bottleneck of this parallel line is M4 and RL reaches a comparably high use ratio, which means this scheduling

Figure 8. Overall layout of re-entrant production line containing four processes.

Figure 9. Gantt diagram of intelligent scheduling based on deep RL on re-entrant production line, PM1 and PM4 are re-entrant
processes.
14 D. Shi et al.

strategy works efficiently enough. Also PTRCP mean value is a bit lower because intelligent scheduling ignores less scarce
resources and process, such as M7 (see Gantt of Figure 6), which may lengthen remain time and therefore the performance
descends. In contrast of deep RL and FIFO in random scenarios, deep RL shows much more stability and robustness (lower
variances) in different randomness, although a bit worse but comparable on efficiency.

Figure 10. In re-entrant production line scheduling, when processing time follows log-normal distribution of different variances, the
scheduling policy convergences are the same. (a) Reward sum, (b) total output, (c) simulation time and (d) scheduling steps.
International Journal of Production Research 15

Then comparing different randomness, it still shows that processing efficiency descends (PTRCP value decreases) and
processing parameter volatility increases (PTRCP variance increases) because of the increasing randomness of event hap-
pening and state transmitting. However, when randomness is extremely high in σi = μi /5 and σi = μi /2, RL agent learns a
totally different and inefficient strategy, though RL agent manages to process products sequentially.

4.3. Re-entrant production line: state of processes


In this section, scheduling object is a re-entrant production line with four processes, p = 4. The overall layout of production
line is shown in Figure 8. Both process 1 and process 4 require PM (1), which is common for clustering production line. We
set processing time of each process ln tpocess (j) ∼ N(μj , σj 2 ), 1 ≤ j ≤ p and (μ1 , μ2 , μ3 , μ4 ) = (100, 200, 250, 150).
We reproduce the last scheduling and draw the Gantt diagrams in Figure 9. And the learning curves are shown as
non-stochastic labelled curve in Figure 10. The reward sum and output keeps increasing and scheduling steps decrease in
the latter half with RL episode and converge to a stable performance. The learning curves vibrate severely, demonstrating
re-entrant scheduling is quite challenging.
Then we experiment in the same condition with different randomness of processing time. There are five experiments
σj = cμj , 1 ≤ j ≤ p, where c = 50 1 1 1 1 1
, 20 , 10 , 5 , 2 . And the reward sum, output, simulation time and scheduling steps figures
are as Figure 10.
Similarly, performance parameters of scheduling policy in different randomness are extracted and we take heuristic
methods FIFO, LPT, SPT and Genetic Algorithm (GA) as comparison, however, such heuristic policies deadlock in constant
processing and fail to schedule constantly, as given in Table 5.
Comparing different scheduling policies, intelligent scheduling based on deep RL managed to learn stable scheduling
policy, while traditional heuristic policies deadlock in production. Deadlock means the current scheduling action is blocked
by other products: for example, agent chooses scheduling P4 while PM (1) is occupied by P1 in Figure 8. Overall layout of
re-entrant production line contains four processes. Since there is at least one circle (P2 –P3 –P4 ) in re-entrant production line,
scheduling agent is supposed to flexibly and tactically design scheduling policy and try to avoid deadlock. And we constrain
the WIP (work-in process) to avoid deadlock for heuristics. Such constraint on WIP leads to a policy with low efficiency
(low use ratio), meanwhile, deep RL show strength in balancing between efficiency and stability. GA manages to produce
products sequentially; however, the use of ratio of each processing machine is quite low, either. This shows deep RL has
competitive strength in efficiently scheduling re-entrant task.

Table 5. Re-entrant production line scheduling performance parameter table.


Scheduling PTRCP
policy/ mean PTRCP M1 use M2 use M3 use
randomness value variance ratio ratio ratio
FIFO deadlock
LPT deadlock
SPT deadlock
FIFO,WIP = 1 0.9943 5.534e-30 0.3536 0.2829 0.3536
FIFO,WIP = 1 0.9943 5.534e-30 0.3536 0.2829 0.3536
FIFO,WIP = 1 0.9943 5.534e-30 0.3536 0.2829 0.3536
GA 0.9943 9.8607 0.3536 0.2829 0.3536
FIFO,WIP = 1 σi =0 0.9943 5.534e-30 0.3536 0.2829 0.3536
FIFO,WIP = 1 σi = μi /50 0.9944 1.241e-6 0.3540 0.2791 0.3572
FIFO,WIP = 1 σi = μi /20 0.9946 1.074e-5 0.3404 0.2797 0.3706
FIFO,WIP = 1 σi = μi /10 0.9949 2.798e-5 0.3330 0.2907 0.3679
FIFO,WIP = 1 σi = μi /5 0.9949 0.000117 0.4186 0.3108 0.2634
FIFO,WIP = 1 σi = μi /2 0.9953 0.000625 0.5793 0.0379 0.3820
σi =0 0.9658 0.0740 0.6461 0.5169 0.6461
σi = μi /50 0.9880 0.0047 0.6503 0.5372 0.6693
σi = μi /20 0.9794 0.0199 0.6142 0.5386 0.6765
σi = μi /10 0.9329 0.3284 0.5131 0.3991 0.6201
σi = μi /5 0.7713 1.2796 0.4499 0.4352 0.4571
σi = μi /2 0.7320 2.8024 0.4855 0.0627 0.5106
16 D. Shi et al.

4.4. Experiment environment and analysis


All the experiments are running under Ubuntu 18.04.2. The computational hardware is CPU: Intel <R>Xeon<R> CPU
E5-2620 [email protected] GHz and GPU: Geforce RTX 2080. For software, we operate our codes on Python 3.6.8 and the deep
learning frame is Tensorflow 1.9.0. Under such hardware and software, the parameters and runtimes for experiments are
listed in Table 6).
Since the reward is crucial for the strategy learned by RL, we make a contrast experiment for parallel scheduling under
two different serials of rewards. We denote the two experiments as Task I and Task II (the rewards are shown in Table A1
in Appendix) in non-stochastic condition. And the scheduling Gantt charts of two tasks are drawn in Figure 11. Task I is
advocated to deal with former processes, and therefore, agent manages to schedule bottleneck M4, which is sufficiently
efficient. While, in Task II, it is advocated to deal with latter processes, and therefore, agent decides to produce several

Table 6. Experiment parameters and runtimes.


Experiment parameters Layer width Layer depth Episode Step Runtime (s)
Linear experiment 100 4 12,000 1000 88,376
Parallel experiment 100 5 13,000 600 76,910
Re-entrant experiment 100 5 12,000 200 24,587

Figure 11. Gantt charts of Task I and Task II in parallel production scheduling task. (a) Task I and (b) Task II.
International Journal of Production Research 17

products once for a while. Agent in Task II calculates all the combination of scheduling and finds this policy gaining the
most total reward, yet fail on efficiency. This contrast experiment shows how a serial of rewards can affect RL on scheduling
task.
Further, it is apparent to notice that when task gets complicated, the vibration of learning increases. This is because agent
explores different experience in different learning stages. Since in re-entrant scenario, state is (P4 , P3 , P2 , P1 ), if we denote
0 for idle, 1 for busy and 2 for finished, the state can be coded as a ternary integer. And the frequency of state integers is
shown in Figure 12. We can find the frequency of states explored varies from learning episodes. As far as we are concerned,
there may be two reasons for vibration of learning curves. For one, complicated task results in complicated branches of
state-action combination, and excellent policy requires enough time and attempts. For another, there may be local maximum
policies which are optimal in a certain period but is soon substituted by other policies, which may lead to severe vibration
in learning procedures.
Also, there is a dip in re-entrant experiment learning curves (Figure 10). To figure out how the dip is generated and
overcome later by agent, we search for what is happening in episode 6000–8000 in Figure 12, which is the dip. The heat
map demonstrates that the frequency of states has a huge conversion in this period. Objectively, this change means the agent
gains an insight of a distinct course and rapidly abandons the early policy. Namely, We consider the dip presenting that the
policy is led into a dead end(dip) and soon the agent finds another policy course(soon recover).

Figure 12. Frequency heat map of states in different learning episodes.


18 D. Shi et al.

Last but not least, we would like to discuss the convergence of RL scheduling, especially in Figures 7 and 10. As shown
in Figure 7(a) and 7(d), total reward and scheduling steps of lower randomness (Non-stachostic and Sigma = 1/60 Mu)
have converged to horizontal level in episode 12,000. This demonstrates the RL policy has reached stable and mature
stage. While in higher randomness, the reason why the lines keep vibrating is that the processing time of each performance
is random. Totally, the performance of scheduling depends on both randomness of processing time and policy network
convergence. In the later period of learning, the network is relatively converged (although still keeps tuning according to
sampling experience), but the processing time is still randomly generated and unknown to agent. Therefore, the performance
seems to vibrate all the time. Actually, in the random time scheduling task, we may not expect a horizontal and stable line.

5. Conclusion and future work


This paper proposes a deep RL-based online scheduling method for discrete automated production line. Deep RL is brought
into discrete automated production line scheduling, which is difficult to manually extract features and prohibitive to form
structured data set. A DES environment is built to provide an intelligent and efficient environment for RL model, reaching
a competitive performance of online intelligent scheduling policy.
This paper proposes a state modelling for discrete automated production line processing. This method simplifies com-
plexity of state and promotes simulation environment more precise, overcoming the weakness of ignoring transferring in
previous literature study.
The intelligent scheduling based on deep RL combines DES method and iterative learning of RL. Agent is given suffi-
cient chances to continually choose transferring actions by adding null event after transferring. And environment combines
DES’s driving forward with events and RL’s step learning. Real transferring consumes time, although far smaller than pro-
cessing time, is fully considered in simulation. Algorithm copes with the confliction of forcing to drive simulation time and
the next event happening.
The intelligent scheduling based on deep RL learns efficient policy in different production lines adaptively and shows
robustness to processing time randomness. Assuming processing time follows log-normal distribution, agent learns to
schedule in different randomness by changing variances. Experiments on scheduling linear, parallel and re-entrant discrete
automated production lines verify the scheduling policies have comparable performance of heuristic scheduling. Deep RL
strikes balance between efficiency and stability in linear and re-entrant scenarios, although still works worse than heuristics
in parallel scenario. And the contrast of deep RL and FIFO in stochastic scenarios shows deep RL has sufficient robustness
to random processing time.
Meanwhile, the vibration after convergence under stochastic condition is a problem. The performance of scheduling
depends on randomness of processing time and policy network convergence. In the later period of learning, the network is
relatively converged, but the processing time is still randomly generated and unknown to agent. So the performance seems
to vibrate all the time.
Last but not least, we would like to talk about reproducibility on other tasks. We believe the capability of RL scheduling
is reproducible in the sense of statistics. In the early period, the agent is advocated to explore randomly, which accumulates
enough various experience. And the learning experience was sampled randomly for learning. Therefore, the learning pro-
cedure may not be fixed and settled. Even in the random scenarios, each processing time is generated randomly, and each
learning experience is unique. However, since we have reported the policy can be learned to convergence, we believe the
RL scheduling method is capable to deal with similarly complicated scheduling task, in the sense of statistics.
In conclusion, intelligent scheduling based on deep RL is a scheduling method using deep RL to schedule single-
product discrete automated production line. Deep RL achieves competitive scheduling performance to heuristic scheduling
method in linear, parallel and re-entrant discrete automated production lines and shows good robustness to processing
time randomness. Meanwhile, the vibration after convergence constraints deep RL scheduling policy’s performance, which
requires further study. In addition, intelligent scheduling in this paper mainly focuses on inhomogeneous processing units,
whose quantity is one, and we would like to research how scheduling will perform (with state modelling c), where there are
multiple homogeneous processing units.

Disclosure statement
No potential conflict of interest was reported by the authors.

References
Bellman, R. E., and S. E. Dreyfus. 1962. Applied Dynamic Programming. Princeton, NJ: Princeton University. Press.
International Journal of Production Research 19

Gabel, T., and M. Riedmiller. 2007. “Scaling Adaptive Agent-Based Reactive Job-Shop Scheduling to Large-Scale Problems.” 2007 IEEE
Symposium on Computational Intelligence in Scheduling, Honolulu, HI, USA, 259–266.
Kara, A., and I. Dogan. 2018. “Reinforcement Learning Approaches for Specifying Ordering Policies of Perishable Inventory Systems.”
Expert Systems With Applications 91: 150–158.
Kuhnle, A., N. Röhrig, and G. Lanza. 2019. “Autonomous Order Dispatching in the Semiconductor Industry Using Reinforcement
Learning.” Procedia CIRP 79: 391–396.
Li, Lin-ying, Rui Lu, and Jie Zang. 2016. “Scheduling Model of Cluster Tools for Concurrent Processing of Multiple Wafer Types.”
Mathematics in Practice and Theory (16): 152–161.
Li, X., J. Wang, and R. Sawhney. 2012. “Reinforcement Learning for Joint Pricing, Lead-Time and Scheduling Decisions in Make-to-
Order Systems.” European Journal of Operational Research 221 (1): 99–109.
Lin, Zhongwei, and Yiping Yao. 2015. “Load Balancing for Parallel Discrete Event Simulation of Stochastic Reaction and Diffusion.”
2015 IEEE International Conference on Smart City/SocialCom/SustainCom (SmartCity), Chengdu, China, 609–614.
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, et al. 2015. “Human-
level Control Through Deep Reinforcement Learning.” Nature 518 (7540): 529–533.
Palombarini, J., and E. Martínez. 2012. “SmartGantt – An Intelligent System for Real Time Rescheduling Based on Relational
Reinforcement Learning.” Expert Systems With Applications 39 (11): 10251–10268.
Paternina-Arboleda, C. D., and T. K. Das. 2005. “A Multi-Agent Reinforcement Learning Approach to Obtaining Dynamic Control
Policies for Stochastic lot Scheduling Problem.” Simulation Modelling Practice & Theory 13 (5): 389–406.
Riedmiller, S. C., and M. A. Riedmiller. 1999. “A Neural Reinforcement Learning Approach to Learn Local Dispatching Policies in
Production Scheduling.” Sixteenth International Joint Conference on Artificial Intelligence.
Shahrabi, J., M. A. Adibi, and M. Mahootchi. 2017. “A Reinforcement Learning Approach to Parameter Estimation in Dynamic Job Shop
Scheduling.” Computers & Industrial Engineering 110: 75–82.
Shin, M., K. Ryu, and M. Jung. 2012. “Reinforcement Learning Approach to Goal-Regulation in a Self-Evolutionary Manufacturing
System.” Expert Systems With Applications 39 (10): 8736–8743.
Shiue, Y. R., K. C. Lee, and C. T. Su. 2018. “Real-time Scheduling for a Smart Factory Using a Reinforcement Learning Approach.”
Computers & Industrial Engineering 125: 604–614.
Stricker, N., A. Kuhnle, R. Sturm, and S. Friess. 2018. “Reinforcement Learning for Adaptive Order Dispatching in the Semiconductor
Industry.” CIRP Annals – Manufacturing Technology 67 (1): 511–514.
Szepesvári, Csaba. 2010. “Algorithms for Reinforcement Learning.” Synthesis Digital Library of Engineering and Computer Science.
San Rafael, CA: Morgan & Claypool.
Waschneck, B., A. Reichstaller, L. Belzner, T. Altenmuller, T. Bauernhansl, T. Knapp, and A. Kyek. 2018a. “Deep Reinforcement Learn-
ing for Semiconductor Production Scheduling.” 2018 29th Annual SEMI Advanced Semiconductor Manufacturing Conference
(ASMC), 301–306.
Waschneck, B., A. Reichstaller, L. Belzner, T. Altenmüller, T. Bauernhansl, T. Knapp, and A. Kyek. 2018b. “Optimization of Global
Production Scheduling with Deep Reinforcement Learning.” Procedia CIRP 72: 1264–1269.
Zhang, W., and T. G. Dietterich. 1995. “A Reinforcement Learning Approach to Job-shop Scheduling.” International Joint Conference on
Artificial Intelligence. Montréal: Morgan Kaufmann Publishers.
Zhang, W., and T. G. Dietterich. 1996. “High-Performance Job-Shop Scheduling with a Time-Delay TD(λ) Network.” Advances in Neural
Information Processing Systems 1996: 1024–1030.
Zhang, Z., L. Zheng, N. Li, W. Wang, S. Zhong, and K. Hu. 2012. “Minimizing Mean Weighted Tardiness in Unrelated Parallel Machine
Scheduling with Reinforcement Learning.” Computers and Operations Research 39 (7): 1315–1324.
Zweben, M., E. Davis, B. Daun, and M. J. Deale. 1993. “Scheduling and Rescheduling with Iterative Repair.” IEEE Transactions on
Systems, Man and Cybernetics 23 (6): 1588–1596.

Appendix
Table A1. Parallel contrast experiment rewards.

Task Action
P1 P2 P3 P4 P5 P6 P7 P8 WAIT
Task I 50 40 40 30 30 20 10 10 − 20
Task II 10 20 20 30 30 40 40 5 − 20

View publication stats

You might also like