Abstract
Although weight and activation quantization is an effective approach for Deep Neural Network (DNN) compression and has a lot of potentials to increase inference speed leveraging bit-operations, there is still a noticeable gap in terms of prediction accuracy between the quantized model and the full-precision model. To address this gap, we propose to jointly train a quantized, bit-operation-compatible DNN and its associated quantizers, as opposed to using fixed, handcrafted quantization schemes such as uniform or logarithmic quantization. Our method for learning the quantizers applies to both network weights and activations with arbitrary-bit precision, and our quantizers are easy to train. The comprehensive experiments on CIFAR-10 and ImageNet datasets show that our method works consistently well for various network structures such as AlexNet, VGG-Net, GoogLeNet, ResNet, and DenseNet, surpassing previous quantization methods in terms of accuracy by an appreciable margin. Code available at https://github.com/Microsoft/LQ-Nets.
D. Zhang, J. Yang and D. Ye—Contributed equally. This work was done when DY was an intern at MSR.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Deep neural networks, especially the deep convolutional neural networks, have achieved tremendous success in computer vision and the broader artificial intelligence field. However, the large model size and high computation cost remain great hurdles for many applications, especially on some constrained devices with limited memory and computational resources.
To address this issue, there has been a surge of interests recently in reducing the model complexity of DNNs. Representative techniques include quantization [3, 6, 9, 18, 21, 22, 29, 34, 39, 52,53,54,55], pruning [12, 13, 17, 36], low-rank decomposition [7, 8, 24, 27, 38, 49, 51], hashing [4], and deliberate architecture design [19, 23, 50]. Among these approaches, quantization based methods represent the network weights with very low precision, thus yielding highly compact DNN models compared to their floating-point counterparts. Moreover, it has been shown that if both the network weights and activations are properly quantized, the convolution operations can be efficiently computed via bitwise operations [21, 39], enabling fast inference without GPU.
Notwithstanding the promising results achieved by the existing quantization-based methods [3, 6, 9, 18, 21, 22, 29, 34, 39, 52,53,54,55], there is still a sizeable accuracy gap between the quantized DNNs and their full-precision counterparts, especially when quantized with extremely low bit-widths such as 1 bit or 2 bits. For example, using the state-of-the-art method of [3], a 50-layer ResNet model [15] with 1-bit weights and 2-bit activations can achieve 64.6% top-1 image classification accuracy on ImageNet validation set [40]. However, the full-precision reference is 75.3% [15], i.e., the absolute accuracy drop induced by quantization is as large as 10.7%.
This work is devoted to pushing the limit of network quantization algorithms to achieve better accuracy with low precision weights and activations. We found that existing methods often use simple, hand-crafted quantizers (e.g., uniform or logarithmic quantization) [11, 22, 31, 37, 52, 53] or otherwise pre-computed quantizers fixed during network training [3]. However, one can never be sure that the simple quantizers are the best choices for network quantization. Moreover, the distributions of weights and activations in different networks and even different network layers may differ a lot. We believe a better quantizer should be made adaptive to the weights and activations to gain more flexibility.
To this end, we propose to jointly train a quantized DNN and its associated quantizers. The proposed method not only makes the quantizers learnable, but also renders them compatible with bitwise operations so as to keep the fast inference merit of properly-quantized neural networks. Our quantizer can be optimized via backpropagation in a standard network training pipeline, and we further propose an algorithm based on quantization error minimization which yields better performance. The proposed quantization can be applied to both network weights and activations, and arbitrary bit-width can be achieved. Moreover, layer-wise quantizers with unshared parameters can be applied to gain further flexibility. We call the networks quantized by our method the “LQ-Nets”.
We evaluate our LQ-Nets with image classification tasks on the CIFAR-10 [25] and ImageNet [40] datasets. The experimental results show that they perform remarkably well across various network structures such as AlexNet [26], VGG-Net [41], GoogLeNet [42], ResNet [15] and DenseNet [20], surpassing previous quantization methods by a wide margin.
2 Related Work
A large number of works have been devoted to reducing DNN model size and improving inference efficiency for practical applications. We briefly review the existing approaches as follows.
Compact Network Design: To achieve fast inference, one strategy is to carefully design a compact network architecture [19, 23, 32, 42, 50]. For example, Network in Network [32] enhanced the local modeling via the micro networks and replaced the costly fully-connected layer by global average pooling. GoogLeNet [42] and SqueezeNet [23] utilized \(1\!\times \!1\) convolution layers to compute reductions before the expensive \(3\!\times \!3\) or \(5\!\times \!5\) convolutions. Similarly, ResNet [15] applied “bottleneck” structures with \(1\!\times \!1\) convolutions when training deeper nets with enormous parameters. The recently proposed computation-efficient network structures MobileNet [19] and ShuffleNet [50] employed depth-wise convolution or group convolution advocated in [5, 48] to reduce the computation cost.
Network Parameter Reduction: Considerable efforts have been devoted to reducing the number of parameters in an existing network [4, 7, 8, 10, 12, 13, 17, 24, 27, 28, 35, 36, 38, 45, 46, 49, 51]. For example, by exploiting the redundancy of the filters weights, some methods substitute the pre-trained weights using their low-rank approximations [7, 8, 24, 27, 38, 49, 51]. Connection pruning was investigated in [12, 13] to reduce the parameters of AlexNet and VGG-Net, where significant reduction was achieved on fully-connected layers. Promising results on modern network architectures such as ResNet were achieved recently by [17, 36]. Another similar technique is to regularize the network by structured sparsity to obtain a hardware-friendly DNN model [28, 35, 45]. Some other approaches such as hashing and vector quantization [44] have also been explored to reduce DNN model complexity [4, 10, 46].
Network Quantization: Another category of existing methods, which our method also belongs to, train low-precision DNNs via quantization. These methods can be further divided into two subcategories: those performing quantization on weights only versus both weights and activations.
For weight-only quantization methods, Courbariaux et al. [6] constrained the weights to only two possible values of \(-1\) and 1 (i.e., binarization or one-bit quantization). They obtained promising results on small datasets using stochastic binarization. Rastegari et al. [39] later demonstrated that deterministic binarization with optimized scale factors to approximate the full-precision weights work better on deeper network structures and larger datasets. To obtain better accuracy, ternary and other multi-bit quantization schemes were explored in [9, 18, 29, 34, 52, 54]. It was shown in [52] that quantizing a network with five bits can achieve similar accuracy to its 32-bit floating-point counterpart by incremental group-wise quantization and re-training.
In the latter regard, Hubara et al. [21] and Rastegari et al. [39] proposed to binarize both weights and activations to \(-1\) and \(+1\). This way, the convolution operations can be implemented by efficient bit-wise operations for substantial speed-up. To address the significant accuracy drop, multi-bit quantization was further studied in [22, 30, 33, 37, 43, 53]. A popular choice of the quantization function is the uniform quantization [22, 53]. Miyashita et al. [37] used logarithmic quantization and improve the inference efficiency via the bitshift operation. Cai et al. [3] proposed to binarize the network weights while quantize the activations using multiple bits. A single activation quantizer computed by fitting the probability density function of a half-wave Gaussian distribution is applied to all network layers and fixed during training. In the multi-bit quantization methods of Tang et al. [43] and Li et al. [30], each bit is used to binarize the residue approximation error from previous bits.
Our proposed method can quantize both the weights and the activations with arbitrary bit-widths. Different from most of the previous methods, our quantizer is adaptively learned during network training.
3 LQ-Nets: Networks with Learned Quantization
In this section, we first briefly introduce the goal of neural network quantization. Then we present the details of our quantization method and how to train a quantized DNN model with it in a standard network training pipeline.
3.1 Preliminaries: Network Quantization
The main operations in deep neural networks are interleaved linear and non-linear transformations, expressed as
where \(\mathbf{w}\in \mathbb {R}^{N}\) is the weight vector, \(\mathbf{a}\in \mathbb {R}^{N}\) is the input activation vector computed by the previous network layer, \(\sigma (\,\cdot \,)\) is a non-linear function, and z is the output activation.Footnote 1 The convolutional layers are composed by multiple convolution filters \(\mathbf{w}_i\in \mathbb {R}^{C\cdot H\cdot W}\), where C, H and W are the number of convolution filter channels, kernel height, and kernel width, respectively. Fully-connected layers can be viewed as a special type of convolutional layer. Modern deep neural networks often have millions of weight parameters, which incur large memory footprints. Meanwhile, the large numbers of inner product operations between the weights and feature vectors lead to high computation cost. The memory and computation costs are great hurdles for many applications on resource-constrained devices such as mobile phones.
The goal of network quantization is to represent the floating-point weights \(\mathbf{w}\) and/or activations \(\mathbf{a}\) with few bits. In general, a quantization function is a piecewise-constant function which can be written as
where \(q_{l}\), \(l=1,...,L\) are the quantization levels and \((t_{l},t_{l+1}]\) are quantization intervals. The quantization function maps all the input values within a quantization interval to the corresponding quantization level, and a quantized value can be encoded by only \(\text {log}_{2}L\) bits. Perhaps the simplest quantizer is the sign function used for binary quantization [21, 39]: \(Q(x)= +1\) if \(x\ge 0\) or \(-1\) otherwise. For quantization with 2 or more bits, the most commonly used quantizer is the uniform quantization function where all the quantization steps \(q_{l+1}-q_l\) are equal [22, 53]. Some methods use logarithmic quantization which uniformly quantizes the data in the log-domain [37].
Quantizing the network weights can generate highly compact and memory-efficient DNN models: using n-bit encoding, the compression rate is \(\frac{32}{n}\) or \(\frac{64}{n}\) compared to the 32-bit or 64-bit floating point representation. Moreover, if both weights and activations are quantized properly, the inner product in Eq. (1) can be computed by bitwise operations such as xnor and popcnt, where xnor is the exclusive-not-or logical operation and popcnt counts the number of 1’s in a bit string. Both the two operations can process at least 64 bits in one or few clock cycle on most general computing platforms such as CPU and GPU, which potentially leads to 64\(\times \) speedup.Footnote 2
3.2 Learnable Quantizers
An optimal quantizer should yield minimal quantization error for the input data distribution:
where p(x) is the probability density function of x. We can never be sure if the popular quantizers such as a uniform quantizer are the optimal selections for the network weights and activations. In Fig. 1 we present the statistical distributions of the weights and activations (after batch normalization (BN) and Rectified Linear Unit (ReLU) layers) in a trained floating-point network. It can be seen that the distributions can be complex and differ across layers, and a uniform quantizer is not optimal for them. Of course, if we train a quantized network the weight and activation distributions may change. But again we can never be sure if any pre-defined quantizer is optimal for our task, and an improper quantizer can easily jeopardize the final accuracy.
To get better network quantizers and improve the accuracy of a quantized network, we propose to jointly train the network and its quantizers. The insight behind is that if the optimizers are learnable and optimized through network training, they can not only minimize the quantization error, but also adapt to the training goal thus improving the final accuracy. A naive way to train the quantizers would be directly optimizing the quantization levels \(\{q_l\}\) in network training. However, such a naive strategy would render the quantization functions not compatible with bitwise operations, which is undesired as we want to keep the fast inference merit of quantized neural networks.
To resolve this issue, we need to confine our quantization functions into a subspace which is compatible with bitwise operations. But how to confine the quantizers into such a space during training? Our solution is inspired by the uniform quantization which is bit-op compatible (see [53]). The uniform quantization essentially maps floating-point numbers to their nearest fixed-point integers with a normalization factor, and the key property for it to be bit-op-compatible is that the quantized values can be decomposed by a linear combination of the bits. Specifically, an integer q represented by a K-bit binary encoding is actually the inner product between a basis vector and the binary coding vector \(\mathbf {b}=[b_1,b_2,...,b_K]^{{\mathrm {T}}}\) where \(b_i\in \{0,1\}\), i.e.,
In order to learn the quantizers while keeping them compatible with bitwise operations, we can simply learn the basis vector which consists of K scalars.
Concretely, our learnable quantization function is simply in the form of
where \(\mathbf {v}\in \mathbb {R}^{K}\) is the learnable floating-point basis and \(\mathbf {e}_{l}\in \{-1,1\}^{K}\) for \(l=1,\ldots ,2^{K}\) enumerates all the K-bit binary encodings from \([-1,\ldots ,-1]\) to \([1,\ldots ,1]\).Footnote 3 For a K-bit quantization, the \(2^{K}\) quantization levels are generated by \(q_{l}=\mathbf {v}^{{\mathrm {T}}}\mathbf {e}_{l}\) for \(l=1,\ldots ,2^K\). Given \(\{q_l\}\) and assuming \(q_{1}<q_{2}<...<q_{2^K}\), it can be easily derived that for any x, the optimal \(\{t_l\}\) minimizing the error in Eq. (3) are simply \(t_{l}=(q_{l-1}+q_{l})/2\) for \(l=2,...,2^K\) (note \(t_1=-\infty \) and \(t_{2^K+1}=+\infty \)). Figure 2 illustrates our quantizer with the 2-bit and 3-bit cases.
We now show how the inner products between our quantized weights and activations can be computed by bitwise operations. Let a weight vector \(\mathbf{w}\in \mathbb {R}^N\) be encoded by the vectors \(\mathbf{b}^w_{i}\in \{-1,1\}^N\), \({i}=1,\ldots ,K_w\) where \(K_w\) is the bit-width for weights and \(\mathbf{b}^w_{i}\) consists of the encoding of the i-th bit for all the values in \(\mathbf{w}\). Similarly, for activation vector \(\mathbf{a}\in \mathbb {R}^N\) we have \(\mathbf{b}^a_{j}\in \{-1,1\}^N\), \({j}=1,\ldots ,K_a\). It can be readily derived that
where \(\mathbf{v}^w\in \mathbb {R}^{K_w}\) and \(\mathbf{v}^a\in \mathbb {R}^{K_a}\) are the learned basis vectors for the weight and activation quantizers respectively, and \(\odot \) denotes the inner product with bitwise operations xnor and popcnt.
In practice, we apply layer-wise quantizers for activations (i.e., one quantizer per layer) and channel-wise quantizers for weights (one quantizer for each conv filter). The number of extra parameters introduced by the quantizers is negligible compared to the large volume of network weights.
3.3 Training Algorithm
To train the LQ-Nets, we use floating-point network weights which are quantized before convolution and optimized with error back-propagation (BP) and gradient descent. After training, they can be discarded and their binary codes and quantizer bases are kept. We now present how we optimize the quantizers.
Quantizer Optimization: A simple and straightforward way to optimize our quantizers is through BP similar to weight optimization. Here we present an algorithm based on quantization error minimization which optimizes our quantizers in the forward passes during training. This algorithm leads to much better performance as we will show later in the experiments.
Let \(\mathbf {x}=[x_{1},...,x_{N}]^{{\mathrm {T}}}\in \mathbb {R}^{N}\) be the full-precision data (weights or activations) and K be the specified bit number for quantization. Our goal is to find an optimal quantizer basis \(\mathbf{v}\in \mathbb {R}^K\) as well as an encoding \(B=[\mathbf {b}_{1},...,\mathbf {b}_{N}]\in \{-1,1\}^{K\times N}\) that minimize the quantization error:
Equation (7) is complex and to provably solve for the optimal solution via brute-force search is exponential in the size of B. For efficiency purposes, we alternately solve for \(\mathbf{v}\) and B in a block coordinate descent fashion:
-
Fix \(\mathbf {v}\) and update B. Given \(\mathbf {v}\), the optimal encoding \(B^*\) can be simply found by looking up the quantization intervals \(t_{1},...,t_{2^K+1}\).
-
Fix B and update \(\mathbf {v}\). Given B, Eq. (7) reduces to a linear regression problem with a closed form solution as
$$\begin{aligned} \mathbf {v}^*=(BB^{{\mathrm {T}}})^{-1}B\mathbf {x\,}. \end{aligned}$$(8)
We iterate the alternation T times. For brevity, we will refer to the above procedure as the QEM (Quantization Error Minimization) algorithm.
Network Training: We use the standard mini-batch based approach to train the LQ-Nets, and our quantizer learning is conducted in the forward passes with the QEM algorithm. Since, for activation quantization, only part of the input data is visible in one iteration due to mini-batch sampling, we apply moving average for the optimized quantizer parameters (i.e., basis vectors). We also apply the moving average strategy for the weight quantizers to gain more stability. The operations in our quantizers are summarized in Algorithm 1.
In a backward pass, direct error back-propagation would be problematic as the gradient of the quantization function is 0 at almost everywhere. To tackle this issue, we use the Straight-Through Estimator (STE) proposed in [2] to compute the gradients. Specifically, for activations we set the gradient of the quantization function to 1 for values between \(q_1\) and \(q_{2^K}\) defined in Eq. (5) and 0 elsewhere; for weights, the gradient is set to 1 everywhere [3]. The QEM algorithm is unrelated to the backward pass so the quantizers will remain unchanged (unless BP is used to train them instead).
4 Experiments
In this section, we evaluate the proposed method on two image classification datasets: CIFAR-10 [25] and ImageNet (ILSVRC12) [40]. The CIFAR-10 dataset consists of 60,000 color images of size \(32\times 32\) belonging to 10 classes (6,000 images per class). There are 50,000 training and 10,000 test images. ImageNet ILSVRC12 contains about 1.2 million training and 50K validation images of 1,000 object categories.
Although our method is designed to quantize both weights and activations to facilitate fast inference through bitwise operations, we also conduct experiments of weight-only quantization and compare with the prior art.
4.1 Implementation Details
Our LQ-Nets are implemented with TensorFlow [1] and trained with the aid of the Tensorpack library [47].Footnote 4 We present our implementation details as follows.
Quantizer Implementation: We apply layer re-ordering to the networks similar to [3, 39]: the typical Conv\(\rightarrow \)BN\(\rightarrow \)ReLU operations is re-organized as BN\(\rightarrow \) ReLU (\(\rightarrow \)Quant.)\(\rightarrow \)Conv. Following previous methods [3, 21, 39, 53, 54], we quantize all the convolution and fully-connected layers but the first and last layers, for which the speedup benefited by bitwise operations is low due to their small channel number or filter size [30, 39].
Network Structures: We conduct experiments on AlexNet [26], ResNet [15], DenseNet [20], two variants of the VGG-Net [41] “VGG-Small”and“VGG-Variant” from [3], and a variant of GoogLeNet [42] also from [3]. VGG-Small is a simplified VGG-Net similar to that of [21, 39] but with only one fully-connected layer. VGG-Variant is a smaller version of the model-A in [14]. The GoogLeNet structure in [3] contains some modifications of the original GoogLeNet (e.g., more filters in some \(1\times 1\) conv layers) and we denote it as “GoogleNet-Variant” in this paper. Detailed structures of these network variants can be found in [3]’s publicly-available implementation.Footnote 5 For ResNet-50, the parameter-free type-A shortcut [15] is adopted in this paper.
Initialization: In all the experiments, our LQ-Nets are trained from scratch (random initialization) without leveraging any pre-trained model. Our quantizers are initialized with uniform quantization (we also tried random initialization and initializing them via pre-computing the quantization levels using [3], however no noticeable difference on the results was observed).
Hyper-parameters and Other Setup: To train on various network architectures, we mostly follow the hyper-parameter settings (learning rate, batch size, training epoch, weight decay, etc.) of their original papers [15, 16, 20]. For fair comparisons with the method of [3], we use the hyper-parameters described in [3] to train all the networks with 1-bit weights and 2-bit activations. The iteration number T in our QEM algorithm is fixed as 1 (no significant benefit was observed with larger values; see Sect. 4.2). The moving average factor for quantizer learning is fixed as 0.9. Details of all our hyper-parameter settings can be found in the supplementary material as well as our released source code.
In the remaining text, we used “W/A” to denote the number of bits used for weights/activations. A bit-width of 32 indicates using 32-bit floating-point values without quantization (thus “w/32” with \(w<32\) indicates weight-only quantization and “32/32” are“full-precision” (FP) models). For the experiments on CIFAR-10, we run our method 5 times and report the mean accuracy.
4.2 Performance Analysis
Effectiveness of the QEM Algorithm: Our quantizer can be trained by either the proposed QEM algorithm or a naive BP procedure. In this experiment, we evaluate the effectiveness of the QEM algorithm and compare it against BP. Table 1 shows the performance of the quantized ResNet-20 models on CIFAR-10 test set, and Fig. 3 presents the corresponding training and testing curves. The quantized network trained using QEM is clearly better than BP for weight-only quantization as well as weight-and-activation quantization. In all the following experiments, we use the QEM algorithm to optimize our quantizers.
Table 2 shows the accuracy of quantized ResNet-20 models with different QEM solver iteration T. As can be seen, using \(T=2,3\) or 4 did not show significant benefit compared to \(T=1\). Note that each time the solver starts from the result of the last training iteration (see Line 6 in Algorithm 1) which is a good starting point especially when the gradients become small after a few epochs. The good performance with \(T=1\) suggests that the iterations of the alternately-directional optimization can be effectively substituted by the training iterations. In this paper, we use \(T=1\) in all the experiments.
Effectiveness of the Learnable Quantizers: The key idea of our method is to apply flexible quantizers and optimize them jointly with the network. Table 3 compares the results of our method and two previous methods: DoReFa-Net [53] and HWGQ [3], the former of which is based on fixed uniform quantizers and latter pre-computes the quantizer by fitting a half-wave Gaussian distribution. It can be seen that using 1-bit weights and 2-bit activations, the ResNet-18 model with our learnable quantizers outperformed HWGQ under the same setting and also outperformed DoReFa-Net with 4-bit activations on ImageNet. More result comparisons on various network structures can be found in Sect. 4.3.
Figure 4 presents the weight and activation statistics in two layers of a trained ResNet-20 model before (i.e., the floating-point values) and after quantization using our method. The network is quantized with “2/2” bits and the floating-point weights are obtained from the last iteration of training (these values can be discarded after training and only quantized values are used in the inference time). The floating-point activations are obtained using all the test images of CIFAR-10. It can be seen that our learned quantizers are not uniform ones and they differ at different layers. Statistical results with more bits can be found in the supplementary material.
Performance w.r.t. Bit-width: We now study the impact of bit-width on the performance of our LQ-Nets. Table 4 shows the results of three network structures: ResNet-20, VGG-Small and ResNet-18.
On the CIFAR-10 dataset, high accuracy can be achieved by our low-precision networks. The accuracy from “3/32” quantization has roughly reached our full-precision result for both ResNet-20 and VGG-Small. The accuracy decreases gracefully with lower bits for weights, and the absolution drops are low even with 1-bit weights: 2.0% for ResNet-20 and 0.3% for VGG-Small. The accuracy drops are more appreciable when quantizing both weights and activations, though the largest absolute drop is only 3.7% for the“1/2” ResNet-20 model. Very minor accuracy drops (maximum 0.4%) are observed for VGG-Small which has many more parameters than ResNet-20.
On the ImageNet dataset which is more challenging, the accuracy drops of the ResNet-18 model after quantization are relatively larger especially with very low precision: the largest absolute drop is 7.7% (70.3%\(\rightarrow \)62.6%) with bit-widths of “1/2”. Nevertheless, our learnable quantizer is particularly beneficial when using 2 or more bits due to its high flexibility. The accuracy of the quantized ResNet-18 quickly increases with 2 or more bits as shown in Table 4. The accuracy gap is almost closed with“4/32” bits (0.3% absolute difference only), and the accuracy drop with the “4/4” case is as low as 1%.
4.3 Comparison with Previous Methods
In this section, we compare the performance of our quantization method with existing methods including TWN [29], TTQ [54], BNN [21], BWN [39], XNOR-Net [39], DoReFa-Net [53], HWQG [3] and ABC-Net [33], with various network architectures tested on CIFAR-10 and ImageNet classification tasks.
Comparison on CIFAR-10: Table 5 presents the results of the VGG-Small model quantized using different methods. All these methods quantize (or binarize) both weights and activations to achieve extremely low precision. With 1-bit weights and 2-bit activations, the accuracy using our method is significantly better than the state-of-the-art method HWGQ (93.4% vs. 92.5%).
Comparison on ImageNet: The results on ImageNet validation set are presented in Table 6. For weight-only quantization, our LQ-Nets outperformed BWN, TWN, TTQ and DoReFa-Net by large margins.
As for quantizing both weights and activations, our results are significantly better than DoReFa-Net and HWGQ when using very low bit-widths (1 bit for weights and few for activations). Our method is even more advantageous when using larger bit-widths. Table 6 shows that with more bits (2, 3, or 4), the accuracy can be dramatically improved by our method. For example, with “4/4” bits, the top-1 accuracy of ResNet-50 is boosted from 68.7% (with“1/2” bits) to 75.1%. The absolute accuracy increase is as high as 6.4%, and the gap to its FP counterpart is reduced to 1.3%. According to Table 6, the accuracy of our LQ-Nets comprehensively surpassed the other competing methods under the same bit-width settings.
4.4 Training Time
Compared to training floating-point networks, our extra cost lies in quantizer optimization. In the QEM algorithm, the cost of solving B is negligible. For N input scalars, the time complexity of solving \(\mathbf {v}\) of length K is \(O(K^2N)\), which is a relatively small compared to the conv operations in theory.Footnote 6 Table 7 shows the total training time comparison based on our current unoptimized implementation. The network is ResNet-18 and no bitwise operation is used in all cases. Our training time increases gracefully with larger bit-widths.
5 Conclusions
We have presented a novel DNN quantization method that led to state-of-the-art accuracy for various network structures. The key idea is to apply learnable quantizers which can be jointly trained with the network parameters to gain more flexibility. Our quantizers can be applied to both weights and activations, and they are made compatible with bitwise operations facilitating fast inference. In future, we plan to deploy our LQ-Nets on some resource-constrained devices such as mobile phones and test their performance.
Notes
- 1.
For brevity, we omit the bias term in Eq. (1).
- 2.
- 3.
Note that \(\mathbf {e}_{i}\) can be either \(\{0,1\}\) encodings or \(\{-1,1\}\) encodings, both of which can yield quantizers compatible with bitwise operations. In our implementation, we adopt the \(\{-1,1\}\) encoding for weights and \(\{0,1\}\) encoding for activations. For convenience we will use the \(\{-1,1\}\) encoding in the remaining text as the example.
- 4.
Our source code is available at https://github.com/Microsoft/LQ-Nets/.
- 5.
https://github.com/zhaoweicai/hwgq (accessed July 10, 2018).
- 6.
To solve for \(\mathbf {v}\) in Eq. (8), we need \(O(K^2N)\) for matrix multiplication \(BB^\mathrm {T}\), \(O(K^3)\) for matrix inverse, and O(KN) for matrix-vector multiplications. Note \(K\ll N\). Let the input and output activation map sizes be \((H,W,C_{in})\) and \((H,W,C_{out})\). The input activation number is \(N_{a}=C_{in}HW\). The time complexity of an \(S\times S\) conv operation with stride 1 is \(O(S^2C_{out}C_{in}HW)=O(S^2C_{out}N_{a})\), whereas that of the quantizer optimization is \(O(K_a^{2}N_{a})\) for activations and \(O(K_w^{2}N_{w})\) for weights.
References
Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). https://www.tensorflow.org/
Bengio, Y., Léonard, N., Courville, A.: Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv:1308.3432 (2013)
Cai, Z., He, X., Sun, J., Vasconcelos, N.: Deep learning with low precision by half-wave Gaussian quantization. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5918–5926 (2017)
Chen, W., Wilson, J.T., Tyree, S., Weinberger, K.Q., Chen, Y.: Compressing neural networks with the hashing trick. In: International Conference on Machine Learning (ICML), pp. 2285–2294 (2015)
Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1251–1258 (2017)
Courbariaux, M., Bengio, Y., David, J.P.: BinaryConnect: training deep neural networks with binary weights during propagations. In: Advances in Neural Information Processing Systems (NIPS), pp. 3123–3131 (2015)
Denil, M., Shakibi, B., Dinh, L., Ranzato, M., de Freitas, N.: Predicting parameters in deep learning. In: Advances in Neural Information Processing Systems (NIPS), pp. 2148–2156 (2013)
Denton, E., Zaremba, W., Bruna, J., LeCun, Y., Fergus, R.: Exploiting linear structure within convolutional networks for efficient evaluation. In: Advances in Neural Information Processing Systems (NIPS), pp. 1269–1277 (2014)
Dong, Y., Ni, R., Li, J., Chen, Y., Zhu, J., Su, H.: Learning accurate low-bit deep neural networks with stochastic quantization. In: British Machine Vision Conference (BMVC) (2017)
Gong, Y., Liu, L., Yang, M., Bourdev, L.: Compressing deep convolutional networks using vector quantization. arXiv:1412.6115 (2014)
Gupta, S., Agrawal, A., Gopalakrishnan, K., Narayanan, P.: Deep learning with limited numerical precision. In: International Conference on Machine Learning (ICML), pp. 1737–1746 (2015)
Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding. In: International Conference on Learning Representations (ICLR) (2016)
Han, S., Pool, J., Tran, J., Dally, W.J.: Learning both weights and connections for efficient neural networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 1135–1143 (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: International Conference on Computer Vision (ICCV), pp. 1026–1034 (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_38
He, Y., Zhang, X., Sun, J.: Channel pruning for accelerating very deep neural networks. In: International Conference on Computer Vision (ICCV), pp. 1389–1397 (2017)
Hou, L., Kwok, J.T.: Loss-aware weight quantization of deep networks. In: International Conference on Learning Representations (ICLR) (2018)
Howard, A.G., et al.: Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861 (2017)
Huang, G., Liu, Z., van der Maaten, L.: Densely connected convolutional networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269 (2017)
Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 4107–4115 (2016)
Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Quantized neural networks: training neural networks with low precision weights and activations. arXiv:1609.07061 (2016)
Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and<0.5MB model size. arXiv:1602.07360 (2016)
Jaderberg, M., Vedaldi, A., Zisserman, A.: Speeding up convolutional neural networks with low rank expansions. In: British Machine Vision Conference (BMVC) (2014)
Krizhevsky, A.: Learning multiple layers of features from tiny images. Technical report (2009)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 1097–1105 (2012)
Lebedev, V., Ganin, Y., Rakhuba, M., Oseledets, I., Lempitsky, V.: Speeding-up convolutional neural networks using fine-tuned CP-decomposition. In: International Conference on Learning Representations (ICLR) (2015)
Lebedev, V., Lempitsky, V.: Fast convnets using group-wise brain damage. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2554–2564 (2016)
Li, F., Zhang, B., Liu, B.: Ternary weight networks. In: NIPS Workshop on Efficient Methods for Deep Neural Networks (2016)
Li, Z., Ni, B., Zhang, W., Yang, X., Gao, W.: Performance guaranteed network acceleration via high-order residual quantization. In: International Conference on Computer Vision (ICCV), pp. 2584–2592 (2017)
Lin, D., Talathi, S., Annapureddy, S.: Fixed point quantization of deep convolutional networks. In: International Conference on Machine Learning (ICML), pp. 2849–2858 (2016)
Lin, M., Chen, Q., Yan, S.: Network in network. In: International Conference on Learning Representations (ICLR) (2014)
Lin, X., Zhao, C., Pan, W.: Towards accurate binary convolutional neural network. In: Advances in Neural Information Processing Systems (NIPS), pp. 345–353 (2017)
Lin, Z., Courbariaux, M., Memisevic, R., Bengio, Y.: Neural networks with few multiplications. In: International Conference on Learning Representations (ICLR) (2016)
Liu, B., Wang, M., Foroosh, H., Tappen, M., Penksy, M.: Sparse convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 806–814 (2015)
Luo, J.H., Wu, J., Lin, W.: Thinet: A filter level pruning method for deep neural network compression. In: International Conference on Computer Vision (ICCV), pp. 5058–5066 (2017)
Miyashita, D., Lee, E.H., Murmann, B.: Convolutional neural networks using logarithmic data representation. arXiv:1603.01025 (2016)
Novikov, A., Podoprikhin, D., Osokin, A., Vetrov, D.: Tensorizing neural networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 442–450 (2015)
Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: XNOR-Net: ImageNet classification using binary convolutional neural networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 525–542. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_32
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (ICLR) (2015)
Szegedy, C., et al.: Going deeper with convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9 (2015)
Tang, W., Hua, G., Wang, L.: How to train a compact binary neural network with high accuracy? In: AAAI Conference on Artificial Intelligence (AAAI), pp. 2625–2631 (2017)
Wang, J., Zhang, T., Sebe, N., Shen, H.T., et al.: A survey on learning to hash. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 40(4), 769–790 (2018)
Wen, W., Wu, C., Wang, Y., Chen, Y., Li, H.: Learning structured sparsity in deep neural networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 2074–2082 (2016)
Wu, J., Leng, C., Wang, Y., Hu, Q., Cheng, J.: Quantized convolutional neural networks for mobile devices. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4820–4828 (2016)
Wu, Y., et al.: Tensorpack (2016). https://github.com/tensorpack/
Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1492–1500 (2017)
Yu, X., Liu, T., Wang, X., Tao, D.: On compressing deep models by low rank and sparse decomposition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7370–7379 (2017)
Zhang, X., Zhou, X., Lin, M., Sun, J.: ShuffleNet: an extremely efficient convolutional neural network for mobile devices. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6848–6856 (2017)
Zhang, X., Zou, J., He, K., Sun, J.: Accelerating very deep convolutional networks for classification and detection. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 38(10), 1943–1955 (2016)
Zhou, A., Yao, A., Guo, Y., Xu, L., Chen, Y.: Incremental network quantization: towards lossless CNNs with low-precision weights. In: International Conference on Learning Representations (ICLR) (2017)
Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., Zou, Y.: DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv:1606.06160 (2016)
Zhu, C., Han, S., Mao, H., Dally, W.J.: Trained ternary quantization. In: International Conference on Learning Representations (ICLR) (2017)
Zhuang, B., Shen, C., Tan, M., Liu, L., Reid, I.: Towards effective low-bitwidth convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7920–7928 (2018)
Acknowledgment
This work is partially supported by the National Natural Science Foundation of China under Grant 61629301.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, D., Yang, J., Ye, D., Hua, G. (2018). LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11212. Springer, Cham. https://doi.org/10.1007/978-3-030-01237-3_23
Download citation
DOI: https://doi.org/10.1007/978-3-030-01237-3_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01236-6
Online ISBN: 978-3-030-01237-3
eBook Packages: Computer ScienceComputer Science (R0)