Orthogonal Array Tuning
Orthogonal Array Tuning
Orthogonal Array Tuning
Xiang Zhang, Xiaocong Chen, Lina Yao, Chang Ge, Manqing Dong
1 Introduction
Deep learning has been recently attracting much attention in both academia
and industry, due to its excellent performance on various research areas such
as computer vision, speech recognition, natural language processing, and brain-
computer interface [15]. Nevertheless, deep learning faces an important chal-
lenge that the performance of the algorithm highly depends on the selection
of hyper-parameters. Compared with traditional machine learning algorithms,
deep learning requires hyper-parameter tuning more urgently because deep neu-
ral networks: 1) have more hyper-parameters to be tunned; 2) have higher de-
pendency on the configuration of hyper-parameters. [14] reports the deep learn-
ing classification accuracy dramatically fluctuates from 32.2% to 92.6% due to
the different selection of hyper-parameters. Therefore, an effective and efficient
hyper-parameter tuning method is necessary.
However, most of the existing hyper-parameter tuning methods have some
drawbacks. In particular, grid search traverses all the possible combinations of
different hyper-parameters, which is a time-consuming and ad-hoc process [2].
Random Search, which is developed based on grid research, set up a grid of
2 Xiang Zhang et.al
2 Related Work
Currently, there are several widely used tuning methods such as grid search op-
timization, random search optimization, and Bayesian optimization. Grid search
and random search require all possible values for each parameter whereas Bayesian
optimization needs the range for each parameter. [4] proposed automated ma-
chine learning method based on the efficiency of Bayesian optimization and [11]
applied multi-task Gaussian processes to Bayesian optimization to enhance the
performance of Bayesian Optimization. However, these methods fail in deep
learning architectures for which have larger amount of hyper-parameters and
the performance highly rely on the configuration.
Apart from the aforementioned methods, the orthogonal array based hyper-
parameter tuning already used in a range of research areas such as mechanical
1
The link will be available after the paper is accepted
Deep Neural Network Hyperparameter Optimization 3
engineering and electrical engineering. J.A Ghani et al. [9] applied orthogonal
array based approach to optimize the cutting parameters in the end milling.
S.S. Mahapatra et al. [8] optimized wire electrical discharge machining (WEDM)
process parameters by orthogonal array method.
Summary. The traditional methods are not suited for deep learning algorithms
while the effectiveness of OATM has been demonstrated in many research topics.
Intuitively, we adopt OATM for deep learning hyper-parameter tuning. To our
best knowledge, our work is the first batch of studies in this area.
be statistically independent with each other). The total 27 nodes on the surface
of the cube denote S while the 9 circled nodes represent the 9 combinations in
O. It’s easy to observe in Figure 1 that the combinations (circled node) sampled
by OATM are uniformly distributed: each edge (totally 27 edges) of the cube
has one circled node and each face (totally 9 faces) has three circled nodes.
Factor No.
Row No.
C3 Factor 1 Factor 2 Factor 3
1 1 1 1
2 1 2 2
3 1 3 3
Factor C
C2
4 2 1 2
5 2 2 3
6 2 3 1
A3
C1 7 3 1 3
B3 A2
B2 8 3 2 1
B1 A1
Factor B Factor A 9 3 3 2
Fig. 1: Orthogonal Array cube. The red Table 1: Orthogonal Array with
circles are selected combinations by Or- 9 rows, 3 factors and each factor
thogonal Array. has 3 levels
– Step 1: Build the F-L (factor-level) table. Determine the number of to-be-
tuned factors and the number of levels for each factor. The levels should be
determined by experience and literature. We further suppose each factor has
the same number of levels2 .
– Step 2: Construct Orthogonal Array Tuning table. The constructed table
should obey the basic composition principles. Here3 shows some commonly
used tables. An alternative way is to use the software. The Orthogonal Array
Tuning table can be generated by software such as Weibull++4 and SPSS5 ,
2
For the sake of simplicity, we consider all the factors with the same number of levels.
More advanced knowledge can be found in [12] for more complex situations.
3
https://www.york.ac.uk/depts/maths/tables/taguchi_table.htm
4
http://www.reliasoft.com/Weibull/index.htm
5
https://www.ibm.com/analytics/au/en/technology/spss/
Deep Neural Network Hyperparameter Optimization 5
more details in this link6 . The Orthogonal Array Tuning table is marked as
LM (hk ) which has k factors, h levels, and totally M rows.
– Step 3: Run the experiments with the hyper-parameters determined by the
Orthogonal Array Tuning table.
– Step 4: Range analysis. This is the key step of OATM. Based on the ex-
periment results in the previous step, range analysis method is employed
to analyze the results and figure out the optimal levels and importance of
each factor. The importance of a factor is defined by its influence on the
results of the experiments. Note that range analysis optimizes each factor
and combines the optimal levels together, which means that the optimized
hyper-parameter combination is not restricted to the existing Orthogonal
Array table.
– Step 5: Run the experiment with the optimal hyper-parameters setting.
The OATM is enabled to optimize the hyper-parameters by utilizing a very
small set of highly representative hyper-parameter combinations. The high effi-
ciency can be demonstrated by a simple sample in Figure 1. The OATM only
takes 9 combinations (red cycles) which means the hyper-parameters can be
optimized by running the experiment for 9 times. In contrast, the grid search re-
quires trying all the 27 combinations (27 nodes in the cube). Therefore, through
OATM, we can save about 67% (0.67 = 1 − 9/27) work in the tuning procedure.
4 Experimental Setting
To evaluate the proposed OATM, we design extensive experiments to tune the
hyper-parameters of two most widely used deep learning structures, i.e., the
Recurrent Neural Networks (RNNs) and the Convolutional Neural Networks
(CNNs). Both of the two deep learning structures are employed on three real-
world applications: 1) a human intention recognition task based on the Elec-
troencephalography (EEG) signals; 2) activity recognition based on wearable
sensors like Inertial Measurement Unit (IMU); 3) activity recognition based on
pervasive sensors like Radio Frequency IDentification (RFID).
In this section, we briefly describe RNN and CNN structures and then introduce
the key hyper-parameters that will be tuned in the experiments.
RNN Structure The RNN [7], one of the most widely-used deep neural net-
works, is generally employed to explore the feature dependencies over time di-
mension through an internal state of the network. Unlike feed-forward neural
networks, RNNs can use their internal memory to process arbitrary sequences
of inputs and exhibit dynamic temporal behavior. Such characteristic ensures
RNNs achieve excellent performance in time-series tasks such as speech recogni-
tion and natural language processing.
The RNN structure used in this paper is shown in Figure 2. In the hidden
layer, to implement the recurrent function, two LSTM (Long Short-Term Mem-
ory) layer is concentrated. LSTM is a simple cell structure which can be used
to build a recurrent neural network. Different from other fully connected layers,
LSTM layer is composed of cells (shown as rectangles) instead of neural nodes
(shown as circles).
In this RNN structure, based on the deep learning hyper-parameters tun-
ing experience, the learning rate, the regularization, and the number of nodes
in each hidden layer are key factors affecting the algorithm performance. The
loss is calculated by cross-entropy function, and the regularization method is `2
norm with the coefficient λ, The loss is finally optimized by the AdamOptimizer
algorithm. In summary, we choose four factors as to-be-tuned hyper-parameters:
the learning rate lr, the regularization coefficient λ, the number of hidden layers
nl , and the number of nodes8 in each hidden layer nn .
CNN Structure The CNN is another popular deep neural network, which
shows strong ability to capture the latent spatial relevance of the input data
8
Assume all the hidden layers have the same fixed number of nodes.
Deep Neural Network Hyperparameter Optimization 7
Fig. 2: The schematic diagram of RNN structure. ‘H’ denotes Hidden, where, for
example, the H 1 layer denotes the first hidden layer.
and has been demonstrated in a wide range of research topics such as com-
puter vision [6]. The CNN structure contains three categories of components:
the convolutional layer, the pooling layer, and the fully connected layer. Each
component may appear one or multiple times in a CNN.
As shown in Figure 3, the schematic diagram of CNN is stacked in the fol-
lowing order: the input layer, the first convolutional layer, the first pooling layer,
the second convolutional layer, the second pooling layer, the first fully connected
layer, the second fully connected layer, and the output layer. The loss function,
regularization method, and optimizer are the same as those in the RNN struc-
ture. Based on hyper-parameters tuning experience on CNN, we choose four
most crucial factors to be tuned by OATM: the learning rate lr0 , the filter size
f 0 , the number of convolutional and pooling layers n0l 9 , and the number of nodes
n0n in the second fully connected layer.
the codes are open-sourced, please check the code for the training details which
are not presented here due to page limitation,
Table 2: Comparison with the state-of-the-art methods over three datasets and
two deep learning architectures. The F1 ∼ F4 represent four tuning factors. Acc,
Prec and F-1 denote accuracy, precision and F-1 score, respectively.
Optimal Factors Metrics
Data Models Methods
F1 F2 F3 F4 #-Runnings Time (s) Acc Prec Recall F-1
Grid 0.005 0.004 6 64 81 6853.6 0.9251 0.9324 0.9139 0.9231
Random 0.01 0.008 6 32 9 766.8 0.7941 0.8003 0.7941 0.7947
RNN
BO 0.0135 0.0049 5 32 9 703.4 0.718 0.7246 0.6474 0.6838
Ours 0.005 0.004 6 64 9 821.9 0.925 0.9335 0.9223 0.9279
EEG
Grid 0.005 4 3 192 81 31891.5 0.828 0.8137 0.8256 0.8269
Random 0.003 2 1 128 9 662.8 0.7268 0.7277 0.7269 0.7266
CNN
BO 0.001 4 3 139 9 721.9 0.7244 0.7302 0.7244 0.7263
Ours 0.003 4 1 128 9 680.4 0.797 0.7969 0.8112 0.8003
Grid 0.005 0.004 6 96 81 3027.2 0.9936 0.9909 0.9976 0.9971
Random 0.015 0.004 4 32 9 1008.5 0.9139 0.9209 0.9412 0.9156
RNN
BO 0.0132 0.0041 4 48 9 1078.8 0.9872 0.9877 0.9851 0.9863
Ours 0.005 0.004 6 64 9 1138.2 0.9913 0.9924 0.9905 0.9919
IMU
Grid 0.003 2 1 128 81 41804.9 0.9732 0.9708 0.9708 0.9707
Random 0.003 2 2 128 9 7089.2 0.9692 0.9691 0.9692 0.9691
CNN
BO 0.0012 2 2 192 9 6559.7 0.9696 0.9702 0.9701 0.9701
Ours 0.003 2 2 128 9 6809.8 0.9702 0.9699 0.9703 0.9702
Grid 0.005 0.008 6 96 81 2846.1 0.9342 0.9388 0.9201 0.9252
Random 0.005 0.012 4 32 9 642.3 0.8891 0.9138 0.8826 0.8895
RNN
BO 0.0142 0.0093 6 79 9 452.2 0.9071 0.8556 0.8486 0.8436
Ours 0.005 0.008 6 64 9 497.1 0.9134 0.9138 0.9029 0.9162
RFID
Grid 0.005 4 2 192 81 7890.8 0.9316 0.9513 0.9316 0.9375
Random 0.005 2 1 128 9 1210.3 0.8683 0.9113 0.8684 0.8779
CNN
BO 0.005 5 3 64 9 872.9 0.9168 0.9058 0.9194 0.9086
Ours 0.005 4 3 192 9 980.3 0.9235 0.9316 0.9188 0.9326
Step 1: Build the F-L table According to the description in Section 4.2,
OATM will work on four different hyper-parameters (factors): the learning rate
lr, the l-2 norm coefficient λ, the number of hidden layers nl , and the number
of nodes nn . The number of levels h is set to be 3 which could be much larger in
real-world applications. Based on the related work and tuning experience [14],
the empirical values are shown in Table 3.
Step 2: OATM table Then, choosing a suitable Orthogonal Array table which
contains 4 factors and 3 levels for our experiments in this link10 wich contains
9 combinations. The OATM table should satisfy two basic principles: i) in each
column, different levels have the same appear times; ii) in any two randomly-
selected columns, nine differently-ordered element combinations are completed
and balanced.
Step 3: Run the experiments Following the OATM table, run the 9 experi-
ments and record the classification accuracy. In our case, each experiment runs
5 times with the corresponding average accuracy recorded. Each experiment is
trained for 1,000 iterations to guarantee the convergence.
Step 4: Range analysis This is the key step of Orthogonal Array Tuning. The
overall range analysis procedure and results are shown in Table 4. The first 9
rows are measured and recorded in Step 3. Rleveli denotes the sum of accuracy
10
https://www.york.ac.uk/depts/maths/tables/taguchi_table.htm
10 Xiang Zhang et.al
under level i. For example, Rlevel1 in factor 1 is the sum of the accuracy in the
first 3 rows (0.196 = 0.875 + 0.8 + 0.521), where factor 1 is on level 1. Aleveli
denotes the average accuracy of level i, calculated by Aleveli = Rleveli /h. In
the above example, we calculate Alevel1 as 0.732 = 2.196/3. Lowest and highest
accuracy values, measuring the maximum and minimum of Aleveli respectively,
are used to calculate the range of Aleveli . The importance denotes how important
the factor is, which is ranked by the range value. Best level is the selected optimal
level based on the Highest Acc while Optimal Value represents the corresponding
value of the best level.
Step 5: Run the optimal setting Since the best level is given by the range
analysis in the previous step, we run the experiment with the optimal hyper-
parameters (lr = 0.004, λ = 0.005, nl = 6, and nn = 64) and finally we got the
optimal accuracy as 0.925. We can observe that:
Deep Neural Network Hyperparameter Optimization 11
– The optimal accuracy 0.925 is higher than the maximum of the accuracy
(0.897) in the OATM experiments, which demonstrates that the OATM is
enabled to approximate the global optimal instead of the local optimal.
– The importance of each factor is ranked through the range analysis: lambda >
nl > lr > nn , which can guide the researcher to grasp the dominating vari-
able in the RNN structure and be helpful in the future development.
The OATM paradigm of CNN is similar to RNN. Here, we only report the
F-L table (Table 3) and the range analysis table (Table 5).
References
1. Andradóttir, S.: A review of random search methods. In: Handbook of Simulation
Optimization, pp. 277–292. Springer (2015)
2. Bergstra, J.S., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyper-parameter
optimization. In: Advances in Neural Information Processing Systems 24, pp. 2546–
2554 (2011)
3. Calandra, R., Gopalan, N., Seyfarth, A., Peters, J., Deisenroth, M.P.: Bayesian gait
optimization for bipedal locomotion. In: Learning and Intelligent Optimization. pp.
274–290 (2014)
4. Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F.:
Efficient and robust automated machine learning. In: Advances in Neural Informa-
tion Processing Systems 28 (2015)
5. Fida, B., Bibbo, D., Bernabucci, I., et al.: Real time event-based segmentation
to classify locomotion activities through a single inertial sensor. In: Proceedings
of the 5th EAI International Conference on Wireless Mobile Communication and
Healthcare. pp. 104–107 (2015)
6. Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., Liu, T., Wang,
X., Wang, G., Cai, J., et al.: Recent advances in convolutional neural networks.
Pattern Recognition 77, 354–377 (2018)
7. Li, K., Xu, H., Wang, Y., Povey, D., Khudanpur, S.: Recurrent neural network
language model adaptation for conversational speech recognition. INTERSPEECH,
Hyderabad pp. 1–5 (2018)
8. Mahapatra, S., Patnaik, A.: Optimization of wire electrical discharge machining
(wedm) process parameters using taguchi method. The International Journal of
Advanced Manufacturing Technology 34(9), 911–925 (2007)
9. Nalbant, M., Gökkaya, H., Sur, G.: Application of taguchi method in the optimiza-
tion of cutting parameters for surface roughness in turning. Materials & design
28(4), 1379–1385 (2007)
10. Snoek, J., Larochelle, H., Adams, R.P.: Practical bayesian optimization of machine
learning algorithms. In: Advances in Neural Information Processing Systems 25,
pp. 2951–2959. Curran Associates, Inc. (2012)
11. Swersky, K., Snoek, J., Adams, R.P.: Multi-task bayesian optimization. In: Ad-
vances in Neural Information Processing Systems 26, pp. 2004–2012 (2013)
12. Taguchi, G., Taguchi, G.: System of experimental design; engineering methods to
optimize quality and minimize costs. Tech. rep. (1987)
13. Yao, L., Sheng, Q.Z., Li, X., Gu, T., Tan, M., Wang, X., Wang, S., Ruan, W.:
Compressive representation for device-free activity recognition with passive rfid
signal strength. IEEE Transactions on Mobile Computing 17(2), 293–306 (2017)
14. Zhang, X., Yao, L., Huang, C., Sheng, Q.Z., Wang, X.: Intent recognition in smart
living through deep recurrent neural networks. In: International Conference on
Neural Information Processing (ICONIP). pp. 748–758. Springer (2017)
15. Zhang, X., Yao, L., Sheng, Q.Z., Kanhere, S.S., Gu, T., Zhang, D.: Converting your
thoughts to texts: Enabling brain typing via deep feature learning of eeg signals
(2018)