Intelligent Scheduling of Discrete Automated Production Line Via Deep Reinforcement Learning
Intelligent Scheduling of Discrete Automated Production Line Via Deep Reinforcement Learning
Intelligent Scheduling of Discrete Automated Production Line Via Deep Reinforcement Learning
net/publication/338842200
CITATIONS READS
50 843
5 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Daming Shi on 25 March 2021.
Daming Shi, Wenhui Fan, Yingying Xiao, Tingyu Lin & Chi Xing
To cite this article: Daming Shi, Wenhui Fan, Yingying Xiao, Tingyu Lin & Chi Xing (2020):
Intelligent scheduling of discrete automated production line via deep reinforcement learning,
International Journal of Production Research, DOI: 10.1080/00207543.2020.1717008
Intelligent scheduling of discrete automated production line via deep reinforcement learning
Daming Shia∗ , Wenhui Fana , Yingying Xiaob , Tingyu Linb and Chi Xingb
a Department of Automation, Tsinghua University, Beijing, People’s Republic of China; b State Key Laboratory of Intelligent
Manufacturing System Technology, Beijing Institute of Electronic System Engineering, Beijing, People’s Republic of China
(Received 6 April 2019; accepted 9 January 2020)
The reinforcement learning (RL) is being used for scheduling to improve the adaptability and flexibility of an automated
production line. However, the existing methods only consider processing time certain and known and ignore production line
layouts and transfer unit, such as robots. This paper introduces deep RL to schedule an automated production line, avoiding
manually extracted features and overcoming the lack of structured data sets. Firstly, we present a state modelling method
in discrete automated production lines, which is suitable for linear, parallel and re-entrant production lines of multiple
processing units. Secondly, we propose an intelligent scheduling algorithm based on deep RL for scheduling automated
production lines. The algorithm establishes a discrete-event simulation environment for deep RL, solving the confliction
of advancing transferring time and the most recent event time. Finally, we apply the intelligent scheduling algorithm into
scheduling linear, parallel and re-entrant automated production lines. The experiment shows that our scheduling strategy
can achieve competitive performance to the heuristic scheduling methods and maintains stable convergence and robustness
under processing time randomness.
Keywords: intelligent scheduling; deep reinforcement learning; discrete-event simulation
1. Introduction
Discrete automated production lines are designed to produce complex products where transfer units grasp and transfer
products, while processing units process products. The automated transferring of raw material input and semi-finished
output raises the issue of task allocation to the processing units and transferring units simultaneously in the schedule. Most
existing researches assume processing time is determined and prove single-transfer-unit scheduling is an NP hard problem.
Typical methods include mathematical programming, Gantt diagram, petri net and branch-and-bound (Li, Lu, and Zang
2016).
Discrete automated production lines consist of multiple processing, assembling, detecting machines and a multi-degree
mechanical robot transferring products. Each equipment has high complexity of assembly or position detecting; their actual
processing time is stochastic. Therefore, the scheduling system is unaware of the exact processing time of each process, and
offline global scheduling is not feasible.
In this paper, we present an online RL-based scheduling method and verify that such strategy is robust to stochastic
processing time. The scheduling method is based on deep reinforcement learning (RL), efficient and stable for scheduling
discrete automated production lines.
Empirically, both supervised learning and unsupervised learning in machine learning (ML) need data set for training.
However, it is difficult and prohibitive to label sufficient scheduling data set manually. While RL in ML enables control
system to optimise policy in exploration, with none initial given data or policy. Deep RL used to deal with constant event
task, such as Atari, Go. Nevertheless, scheduling for discrete automated production lines is a typical discrete-event task.
Thus, in order to enable scheduling system (agent in RL) to learn efficient policy, we build up an accurate and effective
simulation environment for RL.
The remainder of the paper is organised as follows. Section 2 introduces related works. Section 3 presents the deep
RL-based intelligent scheduling method. Section 4 conducts an experiment and makes analysis of scheduling linear, parallel
and re-entrant discrete automated production lines based on deep RL. The fifth section concludes and shows our future
work.
2. Related works
Many researchers have resorted to RL for scheduling policy research. Zhang and Dietterich (1995, 1996) first apply RL on
job-shop scheduling, using temporal-difference TD (λ) to cope with NASA space shuttle load handling, better than iterative
repair method by Zweben et al. (1993). Riedmiller and Riedmiller (1999) use RL for job-shop local dispatch policy on
single and three-resource line, better than heuristic scheduling. Paternina-Arboleda and Das (2005) research single server
and multiple products to obtain dynamic control policy with RL, yet, the state-action space becomes 1.8*107 . Gabel and
Riedmiller (2007) propose a partial observable job-shop scheduling policy with RL. Zhang et al. (2012) explores n-task-
m-machine scheduling problem as reinforcement model to choose within four heuristic methods, obtaining a good online
scheduling system. Palombarini and Martínez (2012) propose the automated generation and update through learning of
rescheduling knowledge. Stricker et al. (2018) uses Q-learning to obtain online scheduling system, considering the influence
of production complexity and processing detail. Waschneck (2018a, 2018b) uses deep RL on multi-agent scheduling. Shiue,
Lee, and Su (2018) use Q-learning to design a multi-rule online scheduling based on MDR.
Moreover, researchers bring in RL for randomness of production and scheduling. Li, Wang, and Sawhney (2012)
design a Q-learning method to decide, accept or refuse tasks in an infinite planning horizon with stochastic demands.
Shin et al. (2012) applies RL for autonomous goal-formation. Lin and Yao (2015) utilise a multi-state Q-learning approach
on load balancing issue. Shahrabi, Adibi, and Mahootchi (2017) consider randomness of task arrival and machine break-
down with Q-factor RL. Kara and Dogan (2018) deal with the inventory management system under the random demand and
deterministic lead time with RL.
Generally, there are two weaknesses in current works:
Firstly, the current discrete automated production line scheduling methods focus on processing units and products rather
than transferring consumption. Stricker et al. (2018) and Kuhnle, Röhrig, and Lanza (2019) consider specific layout and
study transport resources as main point. However, the scope of most intelligent scheduling algorithm still focuses on
processing units simply. Therefore, considering transferring motions in a discrete automated production line is significant.
Secondly, the current adaptive or intelligent scheduling mainly focuses on machine breakdown and task arrival, rather
than the randomness of processing time. Although Waschneck (2018a, 2018b) consider stochastic processing times, the
robustness to stochastic processing time is still scarce. Therefore, it is necessary to research influence on scheduling policy
with randomness of processing time.
In contrast, on the one hand, our proposed RL-based scheduling focused scheduling of transferring units rather than
processing units. On the other hand, the algorithm of RL-based scheduling combines discrete-event simulation (DES) and
deep RL, verifying robustness to randomness of processing time.
Deepmind inferred that the applicability of RL is limited in the domain of whom features are easily extracted or states
are low-dimension and observable. Deep RL takes the method of training neural networks to let the agent perceive high-
dimension input information, achieving end-to-end RL (Mnih et al. 2015). In this paper, we focus on scheduling of discrete
automated production line, which is a task difficult to handcraft feature functions. Moreover, the state dimension could be
prohibitively high on the condition of various processing resources and product demand. Therefore, we consider deep RL
as a proper and capable machine learning method to scheduling discrete automated production lines.
The following sections introduce the state modelling for processing units, the action modelling for transfer units, the
modelling for production line environment reward and the iteration of scheduling based on deep RL.
Completing a product is the goal. When agent completes a product, we give huge encouragement. This bonus indicates
we prefer agent to complete more products rather than waste time accumulating little reward.
In every episode of RL, scheduling steps are finite. Agent accumulates reward in limited steps, which shows the
behaviour of its policy. Generally, setting of reward determines what the policy RL will learn and its convergence as well.
where Q− (s, a, hk ) is the value function in learning process, hk denotes Q-network coefficients in the kth iteration. The
equation shows the experience of choosing action ak on sk with a reward r(s, a). Deep RL receives value function predicted
from current neural network and the reward of environment, in order to get new state-value function until convergence
(Bellman and Dreyfus 1962).
where γ ∈ (0, 1] is decay factor, representing farsighted if close to one, while shortsighted if close to zero. Decay factor
could pass on the future rewards and values to current decision, which may lead agent to think about overall. hi is the
coefficient at iteration i. And hi − is the coefficient when agent predicts value function, which is update during training of
Q-network. Loss function of ANN converges by minimise squared difference of predicted value r + γ max
Q− (s , a ; hi − )
a
and exact value Q(s, a; hi ) (Mnih et al. 2015; Szepesvári 2010).
Finally, the k step experience (ak , sk , rk , sk+1 , done) is stored for learning.
From the k-step algorithm, choosing feasible action will change environment state, and null event gives agent another
chance to choose (even infeasible). And choosing wait will drive the production, therefore changing the state. The simulation
algorithm in this paper achieves the interaction between DES and deep RL decision. Moreover, the processing time is a
random value.
In the former discussion of simulation driving, if agent chooses feasible action, the transferring time will force to
increase simulation time, which may miss the next event, i.e. tsim > tnext . As shown in the algorithm, the next event should
be responded after transferring, i.e. tsim = max(tsim , tnext ). The real production meaning is that if some process is finished
during transferring tasks, transferring will not be interrupted. System should detain the product until transferring unit is idle,
which ensures the transferring time considered without timing sequence error. We did not bring in location of transfer unit
into state, since transferring time is far less than processing time in processing dominant production line. Yet, we calculate
the transferring time in simulation, which ensures the authenticity of simulation.
In this section, the simulation algorithm for intelligent scheduling based on deep RL combines DES and deep RL,
considering transferring time, which makes intelligent scheduling simulation more authentic and complete.
Figure 2. Gantt diagram of intelligent scheduling based on deep RL on linear production line.
8 D. Shi et al.
Figure 3. In linear production line scheduling, when processing time follows log-normal distribution of same mean value and different
variances, the scheduling policy convergences are the same. (a) Reward sum, (b) total output, (c) simulation time and (d) scheduling
steps.
International Journal of Production Research 9
Then we conduct the experiment in the same condition with different randomness of processing time. There are five
experiments σi = cμi , 1 ≤ i ≤ m, where c = 60 1 1 1 1 1
, 30 , 20 , 10 , 5 . And we achieve the reward sum, output, simulation time and
scheduling steps figures in Figure 3.
According to experiments with different randomness, although processing time randomness increases, scheduling policy
based on deep RL converges at efficient behaviour, showing robustness to processing time. And especially, the simulation
time of scheduling increases in learning period and final performance because some processing resources are wasted due to
randomness to processing time, as shows in Figure 4.
Performance parameters of scheduling policy in different randomness are extracted according to Gantt diagrams. We
define processing time ratio on crucial path (PTRCP), as the ratio of the processing time sum on crucial path and the retention
time of product. Riedmiller and Riedmiller (1999), Gabel and Riedmiller (2007) and Waschneck (2018a, 2018b) compared
with heuristic method like first-in-first-out (FIFO) in verification of intelligent scheduling. Therefore, we take heuristic
method FIFO, longest processing time (LPT) and shortest processing time (SPT) as comparison, with no randomness. The
performance parameters of different policies and different randomness of intelligent scheduling based on deep RL are given
in Table 3.
Comparing different scheduling policies, it is convincing to draw the conclusion that intelligent scheduling based on
deep RL behaves close to traditional heuristic scheduling policies. Further, deep RL performs more efficiently (higher
Figure 4. With randomness of processing time increases, randomness of event happening and state transmitting increases. At time
11,500, three resources are busy and the finishing sequence is M3–M2–M1, whose bottleneck is M1. While time 12,500, still three
resources are busy but the finishing sequence is M1–M3–M2, whose bottleneck is M2. The randomness of following state transmitting
leads to waste of processing ability and influences on value estimation of policy learning.
PTRCPs and use ratios) and stably (lower variances) in random scenarios. And then comparing different randomness,
PTRCP mean values decrease and variances increase with randomness increasing. It is mainly because there are resources
wasting and senseless time waiting, and therefore processing efficiency descends (PTRCP value decreases). And since the
overall processing time of products are sum of all processing time that follows log-normal distribution independently, the
overall processing time is nearly log-normal distribution according to the additive property of log-normal function. Thus,
the processing parameter volatility increases (PTRCP variance increases).
Moreover, in analysis of mechanism of deep RL, the production efficiency decreases because randomness of sequent
event happening and following state transmitting increases as randomness increases. For example, in non-stochastic con-
a1 wait
dition, agent may learn such logic s1 → s2 → s3 . . . , where value of state s3 definitely influences on the value estimation
a1 a1
of choosing action s1 → s2 . However, as randomness increases, transition s1 → s2 will not change but the sequent state of
s2 may sometimes be s3 and sometimes s4 , as shown in Figure 4 Gantt diagram, and different sequent state will affect
a1
agent on learning value of action s1 → s2 . So due to randomness of processing time, the randomness of event happening
and sequent state transiting also increases, which makes the convergent policy not optimal for future state and therefore
generates resource wasting. In all, the main reason of policy being affected is that randomness of event happening and state
transmitting increase with randomness of processing time.
Figure 6. Gantt diagram of intelligent scheduling based on deep RL on parallel production line. Apparently, bottleneck of this parallel
line is M4.
International Journal of Production Research 11
Figure 7. In parallel production line scheduling, when processing time follows log-normal distribution of different variances, the
scheduling policy convergences are the same. (a) Reward sum, (b) total output, (c) simulation time and (d) scheduling steps.
12
Table 4. Parallel production line scheduling performance parameter table.
Scheduling policy/ PTRCP PTRCP M1 use M2 use M3 use M4 use M5 use M6 use M7 use
randomness mean value variance ratio ratio ratio ratio ratio ratio ratio
FIFO 0.7467 0.0256 0.5939 0.5939 0.7424 0.8908 0.4454 0.2969 0.4454
LPT 0.7467 0.0256 0.5939 0.5939 0.7424 0.8908 0.4454 0.2969 0.4454
SPT 0.7467 0.0256 0.5939 0.5939 0.7424 0.8908 0.4454 0.2969 0.4454
FIFO σi =0 0.7467 0.0256 0.5939 0.5939 0.7424 0.8908 0.4454 0.2969 0.4454
D. Shi et al.
FIFO σi = μi /50 0.7444 0.0372 0.5949 0.5822 0.7409 0.8931 0.4441 0.2912 0.4382
FIFO σi = μi /20 0.7399 0.1130 0.5821 0.5617 0.7326 0.8737 0.4191 0.2711 0.4280
FIFO σi = μi /10 0.6950 0.5962 0.5113 0.4406 0.4222 0.8512 0.3319 0.3319 0.2103
FIFO σi = μi /5 0.5378 1.9683 0.2764 0.3074 0.3109 0.6453 0.2331 0.0834 0.1371
FIFO σi = μi /2 0.4116 2.1712 0.1403 0.0439 0.0931 0.8409 0.0221 0.0283 0.0664
RL σi =0 0.5943 0.0126 0.5574 0.5574 0.6968 0.8361 0.4181 0.2787 0.4181
RL σi = μi /50 0.5730 0.0341 0.5284 0.5399 0.6999 0.8135 0.4088 0.2737 0.4021
RL σi = μi /20 0.5791 0.0903 0.4780 0.5043 0.5562 0.8430 0.3555 0.2880 0.3857
RL σi = μi /10 0.4825 0.3029 0.3795 0.4290 0.4535 0.6664 0.3120 0.2204 0.2654
RL σi = μi /5 0.3846 0.3778 0.2235 0.5425 0.4909 0.3748 0.1467 0.1327 0.2632
RL σi = μi /2 0.2280 1.0701 0.2744 0.2887 0.4308 0.3447 0.0410 0.1019 0.0971
International Journal of Production Research 13
ln tpocess (j) ∼ N(μj , σj 2 ), 1 ≤ j ≤ p and (μ1 , μ2 , μ3 , μ4 , μ5 , μ6 , μ7 ) = (100, 200, 250, 300, 150, 100, 150), while variances
are proportional to mean value.
Similarly, we schedule parallel automated production line in non-stochastic firstly and reproduce the last scheduling
to draw Gantt diagram in Figure 6. And the learning curves are shown as non-stochastic labelled curves in Figure 7. The
reward sum and output increase with RL episode and converge to a stable performance. And the scheduling steps decrease
in the latter half and converge at an optimal performance. The more vibration of the learning curves indicates complex task
requires more training resources.
Then we experiment in the same condition with different randomness of processing time. There are five experiment σj =
cμj , 1 ≤ j ≤ p, where c = 50 1 1 1 1 1
, 20 , 10 , 5 , 2 . And we get the reward sum, output, simulation time and scheduling steps figures
in Figure 7.
Similarly, performance parameters of scheduling policy in different randomness are extracted and we take heuristic
method FIFO, LPT and SPT as comparison, as given in Table 4 (the crucial path is M1–M2–M4–M6–M7).
Comparing different scheduling policies, intelligent scheduling based on deep RL behaves close to traditional heuristic
methods. The bottleneck of this parallel line is M4 and RL reaches a comparably high use ratio, which means this scheduling
Figure 9. Gantt diagram of intelligent scheduling based on deep RL on re-entrant production line, PM1 and PM4 are re-entrant
processes.
14 D. Shi et al.
strategy works efficiently enough. Also PTRCP mean value is a bit lower because intelligent scheduling ignores less scarce
resources and process, such as M7 (see Gantt of Figure 6), which may lengthen remain time and therefore the performance
descends. In contrast of deep RL and FIFO in random scenarios, deep RL shows much more stability and robustness (lower
variances) in different randomness, although a bit worse but comparable on efficiency.
Figure 10. In re-entrant production line scheduling, when processing time follows log-normal distribution of different variances, the
scheduling policy convergences are the same. (a) Reward sum, (b) total output, (c) simulation time and (d) scheduling steps.
International Journal of Production Research 15
Then comparing different randomness, it still shows that processing efficiency descends (PTRCP value decreases) and
processing parameter volatility increases (PTRCP variance increases) because of the increasing randomness of event hap-
pening and state transmitting. However, when randomness is extremely high in σi = μi /5 and σi = μi /2, RL agent learns a
totally different and inefficient strategy, though RL agent manages to process products sequentially.
Figure 11. Gantt charts of Task I and Task II in parallel production scheduling task. (a) Task I and (b) Task II.
International Journal of Production Research 17
products once for a while. Agent in Task II calculates all the combination of scheduling and finds this policy gaining the
most total reward, yet fail on efficiency. This contrast experiment shows how a serial of rewards can affect RL on scheduling
task.
Further, it is apparent to notice that when task gets complicated, the vibration of learning increases. This is because agent
explores different experience in different learning stages. Since in re-entrant scenario, state is (P4 , P3 , P2 , P1 ), if we denote
0 for idle, 1 for busy and 2 for finished, the state can be coded as a ternary integer. And the frequency of state integers is
shown in Figure 12. We can find the frequency of states explored varies from learning episodes. As far as we are concerned,
there may be two reasons for vibration of learning curves. For one, complicated task results in complicated branches of
state-action combination, and excellent policy requires enough time and attempts. For another, there may be local maximum
policies which are optimal in a certain period but is soon substituted by other policies, which may lead to severe vibration
in learning procedures.
Also, there is a dip in re-entrant experiment learning curves (Figure 10). To figure out how the dip is generated and
overcome later by agent, we search for what is happening in episode 6000–8000 in Figure 12, which is the dip. The heat
map demonstrates that the frequency of states has a huge conversion in this period. Objectively, this change means the agent
gains an insight of a distinct course and rapidly abandons the early policy. Namely, We consider the dip presenting that the
policy is led into a dead end(dip) and soon the agent finds another policy course(soon recover).
Last but not least, we would like to discuss the convergence of RL scheduling, especially in Figures 7 and 10. As shown
in Figure 7(a) and 7(d), total reward and scheduling steps of lower randomness (Non-stachostic and Sigma = 1/60 Mu)
have converged to horizontal level in episode 12,000. This demonstrates the RL policy has reached stable and mature
stage. While in higher randomness, the reason why the lines keep vibrating is that the processing time of each performance
is random. Totally, the performance of scheduling depends on both randomness of processing time and policy network
convergence. In the later period of learning, the network is relatively converged (although still keeps tuning according to
sampling experience), but the processing time is still randomly generated and unknown to agent. Therefore, the performance
seems to vibrate all the time. Actually, in the random time scheduling task, we may not expect a horizontal and stable line.
Disclosure statement
No potential conflict of interest was reported by the authors.
References
Bellman, R. E., and S. E. Dreyfus. 1962. Applied Dynamic Programming. Princeton, NJ: Princeton University. Press.
International Journal of Production Research 19
Gabel, T., and M. Riedmiller. 2007. “Scaling Adaptive Agent-Based Reactive Job-Shop Scheduling to Large-Scale Problems.” 2007 IEEE
Symposium on Computational Intelligence in Scheduling, Honolulu, HI, USA, 259–266.
Kara, A., and I. Dogan. 2018. “Reinforcement Learning Approaches for Specifying Ordering Policies of Perishable Inventory Systems.”
Expert Systems With Applications 91: 150–158.
Kuhnle, A., N. Röhrig, and G. Lanza. 2019. “Autonomous Order Dispatching in the Semiconductor Industry Using Reinforcement
Learning.” Procedia CIRP 79: 391–396.
Li, Lin-ying, Rui Lu, and Jie Zang. 2016. “Scheduling Model of Cluster Tools for Concurrent Processing of Multiple Wafer Types.”
Mathematics in Practice and Theory (16): 152–161.
Li, X., J. Wang, and R. Sawhney. 2012. “Reinforcement Learning for Joint Pricing, Lead-Time and Scheduling Decisions in Make-to-
Order Systems.” European Journal of Operational Research 221 (1): 99–109.
Lin, Zhongwei, and Yiping Yao. 2015. “Load Balancing for Parallel Discrete Event Simulation of Stochastic Reaction and Diffusion.”
2015 IEEE International Conference on Smart City/SocialCom/SustainCom (SmartCity), Chengdu, China, 609–614.
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, et al. 2015. “Human-
level Control Through Deep Reinforcement Learning.” Nature 518 (7540): 529–533.
Palombarini, J., and E. Martínez. 2012. “SmartGantt – An Intelligent System for Real Time Rescheduling Based on Relational
Reinforcement Learning.” Expert Systems With Applications 39 (11): 10251–10268.
Paternina-Arboleda, C. D., and T. K. Das. 2005. “A Multi-Agent Reinforcement Learning Approach to Obtaining Dynamic Control
Policies for Stochastic lot Scheduling Problem.” Simulation Modelling Practice & Theory 13 (5): 389–406.
Riedmiller, S. C., and M. A. Riedmiller. 1999. “A Neural Reinforcement Learning Approach to Learn Local Dispatching Policies in
Production Scheduling.” Sixteenth International Joint Conference on Artificial Intelligence.
Shahrabi, J., M. A. Adibi, and M. Mahootchi. 2017. “A Reinforcement Learning Approach to Parameter Estimation in Dynamic Job Shop
Scheduling.” Computers & Industrial Engineering 110: 75–82.
Shin, M., K. Ryu, and M. Jung. 2012. “Reinforcement Learning Approach to Goal-Regulation in a Self-Evolutionary Manufacturing
System.” Expert Systems With Applications 39 (10): 8736–8743.
Shiue, Y. R., K. C. Lee, and C. T. Su. 2018. “Real-time Scheduling for a Smart Factory Using a Reinforcement Learning Approach.”
Computers & Industrial Engineering 125: 604–614.
Stricker, N., A. Kuhnle, R. Sturm, and S. Friess. 2018. “Reinforcement Learning for Adaptive Order Dispatching in the Semiconductor
Industry.” CIRP Annals – Manufacturing Technology 67 (1): 511–514.
Szepesvári, Csaba. 2010. “Algorithms for Reinforcement Learning.” Synthesis Digital Library of Engineering and Computer Science.
San Rafael, CA: Morgan & Claypool.
Waschneck, B., A. Reichstaller, L. Belzner, T. Altenmuller, T. Bauernhansl, T. Knapp, and A. Kyek. 2018a. “Deep Reinforcement Learn-
ing for Semiconductor Production Scheduling.” 2018 29th Annual SEMI Advanced Semiconductor Manufacturing Conference
(ASMC), 301–306.
Waschneck, B., A. Reichstaller, L. Belzner, T. Altenmüller, T. Bauernhansl, T. Knapp, and A. Kyek. 2018b. “Optimization of Global
Production Scheduling with Deep Reinforcement Learning.” Procedia CIRP 72: 1264–1269.
Zhang, W., and T. G. Dietterich. 1995. “A Reinforcement Learning Approach to Job-shop Scheduling.” International Joint Conference on
Artificial Intelligence. Montréal: Morgan Kaufmann Publishers.
Zhang, W., and T. G. Dietterich. 1996. “High-Performance Job-Shop Scheduling with a Time-Delay TD(λ) Network.” Advances in Neural
Information Processing Systems 1996: 1024–1030.
Zhang, Z., L. Zheng, N. Li, W. Wang, S. Zhong, and K. Hu. 2012. “Minimizing Mean Weighted Tardiness in Unrelated Parallel Machine
Scheduling with Reinforcement Learning.” Computers and Operations Research 39 (7): 1315–1324.
Zweben, M., E. Davis, B. Daun, and M. J. Deale. 1993. “Scheduling and Rescheduling with Iterative Repair.” IEEE Transactions on
Systems, Man and Cybernetics 23 (6): 1588–1596.
Appendix
Table A1. Parallel contrast experiment rewards.
Task Action
P1 P2 P3 P4 P5 P6 P7 P8 WAIT
Task I 50 40 40 30 30 20 10 10 − 20
Task II 10 20 20 30 30 40 40 5 − 20