Orthogonal Array Tuning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Deep Neural Network Hyperparameter

Optimization with Orthogonal Array Tuning

Xiang Zhang, Xiaocong Chen, Lina Yao, Chang Ge, Manqing Dong

University of New South Wales, Australia


[email protected]

Abstract. Deep learning algorithms have achieved excellent performance


lately in a wide range of fields (e.g., computer version). However, a se-
arXiv:1907.13359v2 [cs.LG] 28 Feb 2020

vere challenge faced by deep learning is the high dependency on hyper-


parameters. The algorithm results may fluctuate dramatically under the
different configuration of hyper-parameters. Addressing the above is-
sue, this paper presents an efficient Orthogonal Array Tuning Method
(OATM) for deep learning hyper-parameter tuning. We describe the
OATM approach in five detailed steps and elaborate on it using two
widely used deep neural network structures (Recurrent Neural Networks
and Convolutional Neural Networks). The proposed method is com-
pared to the state-of-the-art hyper-parameter tuning methods including
manually (e.g., grid search and random search) and automatically (e.g.,
Bayesian Optimization) ones. The experiment results state that OATM
can significantly save the tuning time compared to the state-of-the-art
methods while preserving the satisfying performance.

Keywords: orthogonal array, hyper-parameter, deep learning

1 Introduction
Deep learning has been recently attracting much attention in both academia
and industry, due to its excellent performance on various research areas such
as computer vision, speech recognition, natural language processing, and brain-
computer interface [15]. Nevertheless, deep learning faces an important chal-
lenge that the performance of the algorithm highly depends on the selection
of hyper-parameters. Compared with traditional machine learning algorithms,
deep learning requires hyper-parameter tuning more urgently because deep neu-
ral networks: 1) have more hyper-parameters to be tunned; 2) have higher de-
pendency on the configuration of hyper-parameters. [14] reports the deep learn-
ing classification accuracy dramatically fluctuates from 32.2% to 92.6% due to
the different selection of hyper-parameters. Therefore, an effective and efficient
hyper-parameter tuning method is necessary.
However, most of the existing hyper-parameter tuning methods have some
drawbacks. In particular, grid search traverses all the possible combinations of
different hyper-parameters, which is a time-consuming and ad-hoc process [2].
Random Search, which is developed based on grid research, set up a grid of
2 Xiang Zhang et.al

hyper-parameter values and selects random combinations to train the algorithm


[2]. Random search method oversteps some disadvantages of grid search such
as time-consuming but meanwhile brings a major disadvantage which cannot
converge to the global optimum [1]. The randomly selected hyper-parameter
combinations cannot guarantee a steady and competitive result. Apart from the
manually tuning methods, automated tuning methods being more popular in
recent years [10]. Bayesian Optimization, a most widely-used automated hyper-
parameter tunning approach, attempts to find the global optimum in a minimum
number of steps. Nevertheless, the results of Bayesian optimization are sensitive
to parameters of the surrogate model and the performance is highly depending
on the quality of the learning model [3].
To address the aforementioned issue, we propose the Orthogonal Array Tun-
ing Method (OATM) which can achieve a trade-off of the less tuning time and
competitive performance. In detail, the OATM manner is proposed based on
Taguchi Approach [12]. The OATM is a highly fractional orthogonal design
method that is based on a design matrix and allows the user to consider a selected
subset of combinations of multiple factors at multiple levels. Additionally, the
OATM is balanced to ensure that all possible values of all hyper-parameters are
considered equally. Moreover, OATM has been commonly used as an experimen-
tal design method in a wide variety of domains like mechanical engineering [9]
and electrical engineering [8]. To our best knowledge, our work is the first batch
of work adopting orthogonal array into parameter tuning in deep learning.
The proposed OATM adopts the orthogonal array to extract the most repre-
sentative and balanced combinations from the whole set of possible combinations.
The proposed OATM will be explained in detail in the context of two popular
deep learning structures (Section 5). In addition, the OATM is evaluated over
three datasets, which demonstrate the universality and adaptability. We notice
that source codes performing grid search, random search, and especially Bayesian
Optimization on deep learning are hard to online acquire. Thus, we provide the
reusable source codes and datasets for reproduction1 .

2 Related Work
Currently, there are several widely used tuning methods such as grid search op-
timization, random search optimization, and Bayesian optimization. Grid search
and random search require all possible values for each parameter whereas Bayesian
optimization needs the range for each parameter. [4] proposed automated ma-
chine learning method based on the efficiency of Bayesian optimization and [11]
applied multi-task Gaussian processes to Bayesian optimization to enhance the
performance of Bayesian Optimization. However, these methods fail in deep
learning architectures for which have larger amount of hyper-parameters and
the performance highly rely on the configuration.
Apart from the aforementioned methods, the orthogonal array based hyper-
parameter tuning already used in a range of research areas such as mechanical
1
The link will be available after the paper is accepted
Deep Neural Network Hyperparameter Optimization 3

engineering and electrical engineering. J.A Ghani et al. [9] applied orthogonal
array based approach to optimize the cutting parameters in the end milling.
S.S. Mahapatra et al. [8] optimized wire electrical discharge machining (WEDM)
process parameters by orthogonal array method.
Summary. The traditional methods are not suited for deep learning algorithms
while the effectiveness of OATM has been demonstrated in many research topics.
Intuitively, we adopt OATM for deep learning hyper-parameter tuning. To our
best knowledge, our work is the first batch of studies in this area.

3 Orthogonal Array Tuning


In this section, we first provide the background knowledge of orthogonal array,
namely, the definition, the compose principles, and the terminology. Then, we
report the working procedure of OATM.

3.1 Background of Orthogonal Array


An Orthogonal Array is a table/array whose entries come from a fixed finite
set of elements (typically, 1, 2, ..., n), arranged in a specific way that for every
selection of two different columns of the table, all ordered 2-tuples of the elements
appear for the same number of times. For example, Table 1 shows an Orthogonal
Array whose entries come from a fixed finite set 1, 2, 3. In the Orthogonal Array,
the column is called factor and each element in the finite set (or the element in
each column) is called level.
The Orthogonal Array holds two basic composition principles:
– First, in the same column (factor), different levels have the same appearing
times. For example, in the first column of Table 1, each level (level 1, level 2,
and level 3) appears for 3 times. Similarly, in the second and third columns,
each level appears for 3 times.
– Second, in two randomly selected columns (factors), different level combina-
tions are complete and balanced. The number of Orthogonal Array rows is
determined by this principle. For example, in the first and second columns
of Table 1, each column has 3 levels and there are totally 9 different ordered
combinations: (1,2), (1,3), (1,3), (2,1), (2,2), (2,3), (3,1), (3,2), and (3,3). All
the combinations are complete (every combination appears) and balanced
(every combination appears once).
The essence of Orthogonal Array is a representative subset of the exhausting
full set of the elements. We denote the exhausting combination of all the factors
and all the levels(3 factors and 3 levels for the above example) as S. Apparently,
Card(S) = 33 = 27. As shown in Table 1, the Orthogonal Array only has 9
rows. Let’s say Card(O) = 9, where O denotes the set of the combinations in the
OATM. Easy to know, O is a representative subset of S, or O ⊆ S. Intuitively,
we can draw both sets out in a cube. In Figure 1, A1 , A2 , A3 represent 3 levels of
factor A, while factors B, C are with the same tokens (factors are supposed to
4 Xiang Zhang et.al

be statistically independent with each other). The total 27 nodes on the surface
of the cube denote S while the 9 circled nodes represent the 9 combinations in
O. It’s easy to observe in Figure 1 that the combinations (circled node) sampled
by OATM are uniformly distributed: each edge (totally 27 edges) of the cube
has one circled node and each face (totally 9 faces) has three circled nodes.

Factor No.
Row No.
C3 Factor 1 Factor 2 Factor 3
1 1 1 1
2 1 2 2
3 1 3 3
Factor C

C2
4 2 1 2
5 2 2 3
6 2 3 1
A3
C1 7 3 1 3
B3 A2
B2 8 3 2 1
B1 A1
Factor B Factor A 9 3 3 2

Fig. 1: Orthogonal Array cube. The red Table 1: Orthogonal Array with
circles are selected combinations by Or- 9 rows, 3 factors and each factor
thogonal Array. has 3 levels

3.2 Orthogonal Array Tuning Method

In this section, we propose the Orthogonal Array Tuning Method inspired by


the basic principles of orthogonal array. Although deep learning algorithms can
achieve good performance in many research areas, tuning the hyper-parameters
(e.g., the number of layers, the number of nodes in each layer and the learning
rate) is time-consuming and dependent on user’s expertise.
In OATM, the hyper-parameters are regarded as factors and different values
of each hyper-parameter are regarded as levels. The procedure is listed as follows.

– Step 1: Build the F-L (factor-level) table. Determine the number of to-be-
tuned factors and the number of levels for each factor. The levels should be
determined by experience and literature. We further suppose each factor has
the same number of levels2 .
– Step 2: Construct Orthogonal Array Tuning table. The constructed table
should obey the basic composition principles. Here3 shows some commonly
used tables. An alternative way is to use the software. The Orthogonal Array
Tuning table can be generated by software such as Weibull++4 and SPSS5 ,
2
For the sake of simplicity, we consider all the factors with the same number of levels.
More advanced knowledge can be found in [12] for more complex situations.
3
https://www.york.ac.uk/depts/maths/tables/taguchi_table.htm
4
http://www.reliasoft.com/Weibull/index.htm
5
https://www.ibm.com/analytics/au/en/technology/spss/
Deep Neural Network Hyperparameter Optimization 5

more details in this link6 . The Orthogonal Array Tuning table is marked as
LM (hk ) which has k factors, h levels, and totally M rows.
– Step 3: Run the experiments with the hyper-parameters determined by the
Orthogonal Array Tuning table.
– Step 4: Range analysis. This is the key step of OATM. Based on the ex-
periment results in the previous step, range analysis method is employed
to analyze the results and figure out the optimal levels and importance of
each factor. The importance of a factor is defined by its influence on the
results of the experiments. Note that range analysis optimizes each factor
and combines the optimal levels together, which means that the optimized
hyper-parameter combination is not restricted to the existing Orthogonal
Array table.
– Step 5: Run the experiment with the optimal hyper-parameters setting.
The OATM is enabled to optimize the hyper-parameters by utilizing a very
small set of highly representative hyper-parameter combinations. The high effi-
ciency can be demonstrated by a simple sample in Figure 1. The OATM only
takes 9 combinations (red cycles) which means the hyper-parameters can be
optimized by running the experiment for 9 times. In contrast, the grid search re-
quires trying all the 27 combinations (27 nodes in the cube). Therefore, through
OATM, we can save about 67% (0.67 = 1 − 9/27) work in the tuning procedure.

4 Experimental Setting
To evaluate the proposed OATM, we design extensive experiments to tune the
hyper-parameters of two most widely used deep learning structures, i.e., the
Recurrent Neural Networks (RNNs) and the Convolutional Neural Networks
(CNNs). Both of the two deep learning structures are employed on three real-
world applications: 1) a human intention recognition task based on the Elec-
troencephalography (EEG) signals; 2) activity recognition based on wearable
sensors like Inertial Measurement Unit (IMU); 3) activity recognition based on
pervasive sensors like Radio Frequency IDentification (RFID).

4.1 Data Setting


The proposed OATM is evaluated over three different tasks on three benchmark
datasets. Each dataset is divided into a training set (80%) and a testing set
(20%).

EEG-based Intention Recognition. We select the widely used EEG dataset


from PhysioNet eegmmidb database7 which contains 5 different categories. In
this paper, we choose a subset of eegmmidb which contains 28,000 EEG samples.
Every sample is a vector with 64 elements corresponding to 64 channels.
6
https://www.youtube.com/watch?v=C7PIcOXlWQg
7
https://www.physionet.org/pn4/EEGmmidb/
6 Xiang Zhang et.al

IMU-based Activity Recognition. This dataset is collected by 9 participants


[5], which contains 1200000 samples. 8 ADLs are selected as a subset of our paper.
The activity is measured by 3 IMUs and each IMU collects sensor signal with
14 dimensions including two 3-axis accelerometers, one 3-axis gyroscopes, one
3-axis magnetometers, and one thermometer.

RFID-based Activity Recognition. We collect the signals from passive


RFID tags [13] and have 3100 samples in total. 21 activities, including 18 ADLs
(Activity of Daily Living) and 3 abnormal falls, are performed by 6 subjects.
Each sample has 12 dimensions corresponding to 12 RFID tags. RSSI measures
the power present in a received radio signal, which is a convenient environmental
measurement in ubiquitous computing.

4.2 Deep Learning Structures

In this section, we briefly describe RNN and CNN structures and then introduce
the key hyper-parameters that will be tuned in the experiments.

RNN Structure The RNN [7], one of the most widely-used deep neural net-
works, is generally employed to explore the feature dependencies over time di-
mension through an internal state of the network. Unlike feed-forward neural
networks, RNNs can use their internal memory to process arbitrary sequences
of inputs and exhibit dynamic temporal behavior. Such characteristic ensures
RNNs achieve excellent performance in time-series tasks such as speech recogni-
tion and natural language processing.
The RNN structure used in this paper is shown in Figure 2. In the hidden
layer, to implement the recurrent function, two LSTM (Long Short-Term Mem-
ory) layer is concentrated. LSTM is a simple cell structure which can be used
to build a recurrent neural network. Different from other fully connected layers,
LSTM layer is composed of cells (shown as rectangles) instead of neural nodes
(shown as circles).
In this RNN structure, based on the deep learning hyper-parameters tun-
ing experience, the learning rate, the regularization, and the number of nodes
in each hidden layer are key factors affecting the algorithm performance. The
loss is calculated by cross-entropy function, and the regularization method is `2
norm with the coefficient λ, The loss is finally optimized by the AdamOptimizer
algorithm. In summary, we choose four factors as to-be-tuned hyper-parameters:
the learning rate lr, the regularization coefficient λ, the number of hidden layers
nl , and the number of nodes8 in each hidden layer nn .

CNN Structure The CNN is another popular deep neural network, which
shows strong ability to capture the latent spatial relevance of the input data
8
Assume all the hidden layers have the same fixed number of nodes.
Deep Neural Network Hyperparameter Optimization 7

Fig. 2: The schematic diagram of RNN structure. ‘H’ denotes Hidden, where, for
example, the H 1 layer denotes the first hidden layer.

Fig. 3: The schematic diagram of CNN structure. C, P, and FC denote convolu-


tional layer, pooling layer, and fully connected layer, respectively.

and has been demonstrated in a wide range of research topics such as com-
puter vision [6]. The CNN structure contains three categories of components:
the convolutional layer, the pooling layer, and the fully connected layer. Each
component may appear one or multiple times in a CNN.
As shown in Figure 3, the schematic diagram of CNN is stacked in the fol-
lowing order: the input layer, the first convolutional layer, the first pooling layer,
the second convolutional layer, the second pooling layer, the first fully connected
layer, the second fully connected layer, and the output layer. The loss function,
regularization method, and optimizer are the same as those in the RNN struc-
ture. Based on hyper-parameters tuning experience on CNN, we choose four
most crucial factors to be tuned by OATM: the learning rate lr0 , the filter size
f 0 , the number of convolutional and pooling layers n0l 9 , and the number of nodes
n0n in the second fully connected layer.

5 Results and Analysis

In this section, we present the hyper-parameter tuning results by OATM and


compare to the state-of-the-art methods over a very comprehensive scenario
which contains two deep learning structures working on three datasets. For sim-
plification, we set the same hyper-parameter ranges for the three datasets. All
9
We consider each convolutional layer and the following pooling layer as whole.
8 Xiang Zhang et.al

the codes are open-sourced, please check the code for the training details which
are not presented here due to page limitation,

5.1 Overall Comparison


In this section, we compare the proposed OATM with the most competitive state-
of-the-art hyper-parameter tuning approaches including two manually methods
(grid search and random search) and an automated one (Bayesian Optimization).
It’s easy to compute that there are 81 = 34 exhausted combinations in grid
search since we have four factors with three levels of the hyper-parameters. Thus,
grid search requires 81 runnings to get the optimal hyper-parameters. On the
other hand, our method requires only 9 runnings described in the corresponding
orthogonal array table (detailed in Section 5.2). Due to the numbers of runnings
in random search and Bayesian Optimization are manually set, they are set as 9
runnings which is same with our method in order to keep fair comparison. The
baselines are introduced here:
– Grid search simply goes through all the possible combinations according to
the values provided which is exhaustive [2].
– Random search randomly picks combinations from all possible ones. It may
not find a decent combination but is widely adopted in industry for the
high-efficiency [1].
– Bayesian optimization uses a Gaussian process to minimize the loss function
in order to maximize performance [10].
The hyper-parameter levels are selected based on empirical values. For grid
search, random search, and our OATM, the empirical values are discrete as
listed in Table 3 (take eegmmidb as an example). For Bayesian Optimization,
the hyper-parameter ranges from the maximum and minimum of each factor. For
instance, the lr ranges from [0.005, 0.015]. All the experiments are implemented
in NVIDIA Titian X (Pascal) GPU and each reported value is the average of
five runnings under the same setting.
The comparison results are shown in Table 2. It can be observed that:
– under the same running numbers (9 runnings), our method outperforms the
random search and Bayesian Optimization over all the datasets and deep
learning architectures;
– our method performs slightly lower than grid search but still competitive,
however, take EEG dataset with RNN as an example, our approach saves
88% tuning time which is indicated from that the OATM only requires 9
runnings and costs 821.9s while grid search requires 81 runnings and 6853.6s;
– the optimal factors selected by our method approximate to the global optimal
factors selected by grid search.

5.2 Case Study in RNN and CNN


In this section, we take EEG classification as an example to present the detailed
procedure of OATM in RNN and CNN architecture. The overall paradigm can
be divided into five steps.
Deep Neural Network Hyperparameter Optimization 9

Table 2: Comparison with the state-of-the-art methods over three datasets and
two deep learning architectures. The F1 ∼ F4 represent four tuning factors. Acc,
Prec and F-1 denote accuracy, precision and F-1 score, respectively.
Optimal Factors Metrics
Data Models Methods
F1 F2 F3 F4 #-Runnings Time (s) Acc Prec Recall F-1
Grid 0.005 0.004 6 64 81 6853.6 0.9251 0.9324 0.9139 0.9231
Random 0.01 0.008 6 32 9 766.8 0.7941 0.8003 0.7941 0.7947
RNN
BO 0.0135 0.0049 5 32 9 703.4 0.718 0.7246 0.6474 0.6838
Ours 0.005 0.004 6 64 9 821.9 0.925 0.9335 0.9223 0.9279
EEG
Grid 0.005 4 3 192 81 31891.5 0.828 0.8137 0.8256 0.8269
Random 0.003 2 1 128 9 662.8 0.7268 0.7277 0.7269 0.7266
CNN
BO 0.001 4 3 139 9 721.9 0.7244 0.7302 0.7244 0.7263
Ours 0.003 4 1 128 9 680.4 0.797 0.7969 0.8112 0.8003
Grid 0.005 0.004 6 96 81 3027.2 0.9936 0.9909 0.9976 0.9971
Random 0.015 0.004 4 32 9 1008.5 0.9139 0.9209 0.9412 0.9156
RNN
BO 0.0132 0.0041 4 48 9 1078.8 0.9872 0.9877 0.9851 0.9863
Ours 0.005 0.004 6 64 9 1138.2 0.9913 0.9924 0.9905 0.9919
IMU
Grid 0.003 2 1 128 81 41804.9 0.9732 0.9708 0.9708 0.9707
Random 0.003 2 2 128 9 7089.2 0.9692 0.9691 0.9692 0.9691
CNN
BO 0.0012 2 2 192 9 6559.7 0.9696 0.9702 0.9701 0.9701
Ours 0.003 2 2 128 9 6809.8 0.9702 0.9699 0.9703 0.9702
Grid 0.005 0.008 6 96 81 2846.1 0.9342 0.9388 0.9201 0.9252
Random 0.005 0.012 4 32 9 642.3 0.8891 0.9138 0.8826 0.8895
RNN
BO 0.0142 0.0093 6 79 9 452.2 0.9071 0.8556 0.8486 0.8436
Ours 0.005 0.008 6 64 9 497.1 0.9134 0.9138 0.9029 0.9162
RFID
Grid 0.005 4 2 192 81 7890.8 0.9316 0.9513 0.9316 0.9375
Random 0.005 2 1 128 9 1210.3 0.8683 0.9113 0.8684 0.8779
CNN
BO 0.005 5 3 64 9 872.9 0.9168 0.9058 0.9194 0.9086
Ours 0.005 4 3 192 9 980.3 0.9235 0.9316 0.9188 0.9326

Step 1: Build the F-L table According to the description in Section 4.2,
OATM will work on four different hyper-parameters (factors): the learning rate
lr, the l-2 norm coefficient λ, the number of hidden layers nl , and the number
of nodes nn . The number of levels h is set to be 3 which could be much larger in
real-world applications. Based on the related work and tuning experience [14],
the empirical values are shown in Table 3.
Step 2: OATM table Then, choosing a suitable Orthogonal Array table which
contains 4 factors and 3 levels for our experiments in this link10 wich contains
9 combinations. The OATM table should satisfy two basic principles: i) in each
column, different levels have the same appear times; ii) in any two randomly-
selected columns, nine differently-ordered element combinations are completed
and balanced.
Step 3: Run the experiments Following the OATM table, run the 9 experi-
ments and record the classification accuracy. In our case, each experiment runs
5 times with the corresponding average accuracy recorded. Each experiment is
trained for 1,000 iterations to guarantee the convergence.
Step 4: Range analysis This is the key step of Orthogonal Array Tuning. The
overall range analysis procedure and results are shown in Table 4. The first 9
rows are measured and recorded in Step 3. Rleveli denotes the sum of accuracy
10
https://www.york.ac.uk/depts/maths/tables/taguchi_table.htm
10 Xiang Zhang et.al

Table 3: Factor-Level table of RNN and CNN.


Factor 1 (lr) Factor 2 (λ) Factor 3 (nl ) Factor 4 (nn )
Level 1 0.005 0.004 4 32
RNN
Level 2 0.01 0.008 5 64
Level 3 0.015 0.012 6 96
Factor 1 (lr0 ) Factor 2 (f 0 ) Factor 3 (n0l ) Factor 4 (n0n )
Level 1 0.001 [1,2] 1 64
CNN
Level 2 0.003 [1,4] 2 128
Level 3 0.005 [1,6] 3 192

Table 4: Range analysis of RNN


Row No. Factor 1 (lr) Factor 2 (λ) Factor 3 (nl ) Factor 4 (nn ) Acc
1 0.005 0.004 4 32 0.875
2 0.005 0.008 5 64 0.8
3 0.005 0.012 6 96 0.521
4 0.01 0.004 5 96 0.888
5 0.01 0.008 6 32 0.797
6 0.01 0.012 4 64 0.451
7 0.015 0.004 6 64 0.897
8 0.015 0.008 4 96 0.335
9 0.015 0.012 5 32 0.471
Rlevel1 2.196 2.66 1.661 2.143
Rlevel2 2.136 1.932 2.159 2.148
Rlevel3 1.703 1.443 2.215 1.744
Alevel1 0.732 0.887 0.554 0.714
Alevel2 0.712 0.644 0.720 0.716
Alevel3 0.568 0.481 0.738 0.581
Lowest Acc 0.568 0.481 0.554 0.581
Highest Acc 0.732 0.887 0.738 0.716
Range 0.164 0.406 0.184 0.135
Importance lambda > nl > lr > nn
Best Level Level 1 Level 1 Level 3 Level 2
Optimal Value 0.005 0.004 6 64 0.925

under level i. For example, Rlevel1 in factor 1 is the sum of the accuracy in the
first 3 rows (0.196 = 0.875 + 0.8 + 0.521), where factor 1 is on level 1. Aleveli
denotes the average accuracy of level i, calculated by Aleveli = Rleveli /h. In
the above example, we calculate Alevel1 as 0.732 = 2.196/3. Lowest and highest
accuracy values, measuring the maximum and minimum of Aleveli respectively,
are used to calculate the range of Aleveli . The importance denotes how important
the factor is, which is ranked by the range value. Best level is the selected optimal
level based on the Highest Acc while Optimal Value represents the corresponding
value of the best level.

Step 5: Run the optimal setting Since the best level is given by the range
analysis in the previous step, we run the experiment with the optimal hyper-
parameters (lr = 0.004, λ = 0.005, nl = 6, and nn = 64) and finally we got the
optimal accuracy as 0.925. We can observe that:
Deep Neural Network Hyperparameter Optimization 11

Table 5: Range analysis of CNN


Row No. Factor 1 (lr0 ) Factor 2 (f 0 ) Factor 3 (n0l ) Factor 4 (n0n ) Acc
1 0.001 [1,2] 1 64 0.707
2 0.001 [1,4] 2 128 0.771
3 0.001 [1,6] 3 192 0.775
4 0.003 [1,2] 2 192 0.779
5 0.003 [1,4] 3 64 0.752
6 0.003 [1,6] 1 128 0.797
7 0.005 [1,2] 3 128 0.784
8 0.005 [1,4] 1 192 0.782
9 0.005 [1,6] 2 64 0.756
Rlevel1 2.253 2.27 2.993 2.215
Rlevel2 2.328 2.305 2.306 2.352
Rlevel3 2.322 2.328 2.311 2.336
Alevel1 0.751 0.757 0.998 0.738
Alevel2 0.776 0.768 0.769 0.784
Alevel3 0.774 0.776 0.770 0.779
Lowest Acc 0.751 0.757 0.769 0.738
Highest Acc 0.776 0.776 0.998 0.784
Range 0.025 0.019 0.229 0.046
Importance n0l > n0n > lr0 > f 0
Best Level Level 2 Level 3 Level 1 Level 2
Optimal Value 0.003 [1,6] 1 128 0.797

– The optimal accuracy 0.925 is higher than the maximum of the accuracy
(0.897) in the OATM experiments, which demonstrates that the OATM is
enabled to approximate the global optimal instead of the local optimal.
– The importance of each factor is ranked through the range analysis: lambda >
nl > lr > nn , which can guide the researcher to grasp the dominating vari-
able in the RNN structure and be helpful in the future development.

The OATM paradigm of CNN is similar to RNN. Here, we only report the
F-L table (Table 3) and the range analysis table (Table 5).

6 Discussion and Conclusion

In this paper, we present an efficient OATM approach for hyper-parameter tun-


ing in the context of deep learning. The proposed OATM is evaluated over two
popular deep learning structures(RNN and CNN) over three real-world datasets.
The experiment results show that our approach outperforms state-of-the-art
hyper-parameter tuning methods.
One disadvantage of OATM is that it requires the empirical values as pre-
requisites. The values of the F-L table should be chosen appropriately. How-
ever, this is the common drawback of all the tuning methods. For instance, the
hyper-parameter ranges in Bayesian Optimization are also pre-defined based on
empirical values.
12 Xiang Zhang et.al

References
1. Andradóttir, S.: A review of random search methods. In: Handbook of Simulation
Optimization, pp. 277–292. Springer (2015)
2. Bergstra, J.S., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyper-parameter
optimization. In: Advances in Neural Information Processing Systems 24, pp. 2546–
2554 (2011)
3. Calandra, R., Gopalan, N., Seyfarth, A., Peters, J., Deisenroth, M.P.: Bayesian gait
optimization for bipedal locomotion. In: Learning and Intelligent Optimization. pp.
274–290 (2014)
4. Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F.:
Efficient and robust automated machine learning. In: Advances in Neural Informa-
tion Processing Systems 28 (2015)
5. Fida, B., Bibbo, D., Bernabucci, I., et al.: Real time event-based segmentation
to classify locomotion activities through a single inertial sensor. In: Proceedings
of the 5th EAI International Conference on Wireless Mobile Communication and
Healthcare. pp. 104–107 (2015)
6. Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., Liu, T., Wang,
X., Wang, G., Cai, J., et al.: Recent advances in convolutional neural networks.
Pattern Recognition 77, 354–377 (2018)
7. Li, K., Xu, H., Wang, Y., Povey, D., Khudanpur, S.: Recurrent neural network
language model adaptation for conversational speech recognition. INTERSPEECH,
Hyderabad pp. 1–5 (2018)
8. Mahapatra, S., Patnaik, A.: Optimization of wire electrical discharge machining
(wedm) process parameters using taguchi method. The International Journal of
Advanced Manufacturing Technology 34(9), 911–925 (2007)
9. Nalbant, M., Gökkaya, H., Sur, G.: Application of taguchi method in the optimiza-
tion of cutting parameters for surface roughness in turning. Materials & design
28(4), 1379–1385 (2007)
10. Snoek, J., Larochelle, H., Adams, R.P.: Practical bayesian optimization of machine
learning algorithms. In: Advances in Neural Information Processing Systems 25,
pp. 2951–2959. Curran Associates, Inc. (2012)
11. Swersky, K., Snoek, J., Adams, R.P.: Multi-task bayesian optimization. In: Ad-
vances in Neural Information Processing Systems 26, pp. 2004–2012 (2013)
12. Taguchi, G., Taguchi, G.: System of experimental design; engineering methods to
optimize quality and minimize costs. Tech. rep. (1987)
13. Yao, L., Sheng, Q.Z., Li, X., Gu, T., Tan, M., Wang, X., Wang, S., Ruan, W.:
Compressive representation for device-free activity recognition with passive rfid
signal strength. IEEE Transactions on Mobile Computing 17(2), 293–306 (2017)
14. Zhang, X., Yao, L., Huang, C., Sheng, Q.Z., Wang, X.: Intent recognition in smart
living through deep recurrent neural networks. In: International Conference on
Neural Information Processing (ICONIP). pp. 748–758. Springer (2017)
15. Zhang, X., Yao, L., Sheng, Q.Z., Kanhere, S.S., Gu, T., Zhang, D.: Converting your
thoughts to texts: Enabling brain typing via deep feature learning of eeg signals
(2018)

You might also like