HQOD: Harmonious Quantization for
Object Detection

Long Huang1   Zhiwei Dong1   Song-Lu Chen1   Ruiyao Zhang1
Shutong Ti1   Feng Chen2   Xu-Cheng Yin1†
1 University of Science and Technology Beijing   2 EEasy Technology Company Ltd.
{long.huang.cn, dongz.cn}@outlook.com   [email protected]   [email protected]  
[email protected]   [email protected]   [email protected]
Corresponding author.
Abstract

Task inharmony problem commonly occurs in modern object detectors, leading to inconsistent qualities between classification and regression tasks. The predicted boxes with high classification scores but poor localization positions or low classification scores but accurate localization positions will worsen the performance of detectors after Non-Maximum Suppression. Furthermore, when object detectors collaborate with Quantization-Aware Training (QAT), we observe that the task inharmony problem will be further exacerbated, which is considered one of the main causes of the performance degradation of quantized detectors. To tackle this issue, we propose the Harmonious Quantization for Object Detection (HQOD) framework, which consists of two components. Firstly, we propose a task-correlated loss to encourage detectors to focus on improving samples with lower task harmony quality during QAT. Secondly, a harmonious Intersection over Union (IoU) loss is incorporated to balance the optimization of the regression branch across different IoU levels. The proposed HQOD can be easily integrated into different QAT algorithms and detectors. Remarkably, on the MS COCO dataset, our 4-bit ATSS with ResNet-50 backbone achieves a state-of-the-art mAP of 39.6%, even surpassing the full-precision one. Codes are available at https://github.com/Menace-Dragon/VP-QOD.

Index Terms:
Object Detection, Task Inharmony, Model Quantization, Quantization-Aware Training

I Introduction

Deep Convolutional Neural Networks (CNNs) have made remarkable strides in various applications of object detection [1, 2, 3, 4]. Nevertheless, these detectors with outstanding performance exhibit significant computational and parameter demands, facing challenges to efficiently run on devices with limited resources (e.g., mobile phones or drones).

In modern object detectors, the Task Inharmony (TI) problem has always been a focal point of attention [5, 6, 7]. Usually, a multi-task pipeline is used to generate both location coordinates and corresponding labels for an object, including a classification branch and a regression branch with two parallel heads. This parallel design can result in an inconsistent distribution of classification scores and regression scores (i.e., Intersection over Union, IoU).

Refer to caption
((a)) Baseline LSQ
Refer to caption
((b)) Ours
Figure 1: The results of object detection between the baseline LSQ [8] and our proposed framework. The ground truth is represented by the red bounding box, and the classification scores of the predicted boxes are explicitly indicated. Our framework enhances the harmonious relationship between the classification task and regression task, producing more accurate bboxes.

Moreover, to efficiently deploy detectors on resource-constrained devices, the application of model compression techniques is necessary.

As one of the popular model compression techniques, Quantization-Aware Training (QAT) methods [9, 10, 8, 11, 12] are introduced as the widely accepted approach to achieve low-bit quantization while preserving near full-precision performance. By simulating the feedforward quantization operations during time-consuming training or fine-tuning, the network can readily adjust to the quantization noise, leading to more optimal solutions. LSQ [8] treats the quantization parameter, step size s𝑠sitalic_s as a learnable parameter, enabling s𝑠sitalic_s to adaptively learn optimal casting. TQT [11] adopts the concept of learnable s𝑠sitalic_s and applies it to hardware-friendly power-of-2 (PoT) scale quantization, achieving exciting results in PoT scale quantization. AQD [12] introduces the Sync-BN in the shared detection head to alleviate the suboptimal issue of s𝑠sitalic_s arising from the multi-level inputs. However, when quantizing low-bit object detectors, the performance still exhibits substantial gaps compared to the full-precision ones.

In this work, we identify that the exacerbation of the TI problem under low-bit constraints is one of the primary reasons leading to the performance gap. Taking LSQ [8] as an example, the quantized RetinaNet [2] with ResNet-18 [13] backbone generates a representative instance of inconsistent bounding boxes, as shown in Fig. 1(a). There are two inharmonious candidates (a yellow bounding box and a blue bounding box) and one ground-truth box colored in red. After the Non-Maximum Suppression (NMS) procedure, the yellow one having a high IoU but a low classification score will be suppressed by the less accurate blue one. That is to say, the suboptimal result is preserved, while the best result is overlooked. In our proposed method, this issue can be effectively alleviated by encouraging detectors to generate more harmonious samples while retaining superior detection boxes after the QAT process, as illustrated in Fig. 1(b).

In addition, we further discover that TI tends to deteriorate further when the required bit widths for QAT get lower. To illustrate this phenomenon, we yield Fig. 2 by visualizing the statistical results of predicted true positive (TP) samples after NMS. With the decrease in bit widths, the number of high-quality prediction samples (i.e., both classification score and IoU score at a high level) exhibits notable reduction, and the statistic distributions deviate from the ideal elliptical line. This observation indicates that the relationship between the two tasks becomes increasingly inharmonious as the bit width constraints decrease.

Refer to caption
Figure 2: The statistical results for true positive (TP) samples after NMS. The red dashed ellipse signifies the ideal numerical distribution, where the classification scores and IoU values are both at high levels. Note that FP32 represents a full-precision setting. INT4/2 represents that models are quantized under the 4/2-bit constraints.

To harmonize the optimization of tasks in QAT, we propose the Task-Correlated (TCorr) Loss to dynamically optimize the low harmony quality samples. We increase the weights of samples characterized by the wide gap between task scores while suppressing the weights of samples with a small gap between tasks. Thus, the detector is regularized toward harmonious optimization during the low-bit QAT process. Furthermore, we additionally introduce the Harmonious IoU (HIoU) loss to dynamically coordinate the contributions of samples at each IoU level, aiming to suppress the generation of low IoU samples and facilitate the generation of high IoU samples. By embedding TCorr loss and HIoU loss into QAT phase, our QAT framework, namely Harmonic Quantization for Object Detection (HQOD), is constructed, facilitating the generation of a more harmonious and high-quality output distribution under low-bit constraints, as illustrated in Fig. 2.

Extensive experiments on the PASCAL VOC [14] dataset and MS COCO [15] dataset demonstrate the robustness and generality of our method for remarkably improving the performance of quantized low-bit object detectors. On the PASCAL VOC benchmark, we achieve an average improvement of 0.75% mAP for LSQ and an average improvement of 0.65% mAP for AQD, respectively. On the MS COCO benchmark, we achieve numerous state-of-the-art results. Specifically, our 4-bit ATSS [3] with ResNet-50 [13] achieves 39.6% mAP, outperforming the LSQ by 0.6% mAP and surpassing the full-precision one by 0.2% mAP. Even under extreme constraints, our 2-bit ATSS can bring an improvement of 1.4% mAP compared to TQT.

II Proposed Method

II-A Paradigm of Quantization-Aware Training

To enhance comprehension of the operational logic underlying these baselines, we initially present the general paradigm of QAT. Quantizing a neural network model can be defined as a finite affine process. Given a full-precision vector 𝐕=[v0,,vn1]𝐕subscript𝑣0subscript𝑣𝑛1\mathbf{V}=[v_{0},...,v_{n-1}]bold_V = [ italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ] and b𝑏bitalic_b bit, the quantization function Qb()subscript𝑄𝑏Q_{b}(\cdot)italic_Q start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( ⋅ ) can be formulated as follow:

𝐕^=Qb(𝐕){q0,q1,,q2b1}^𝐕subscript𝑄𝑏𝐕subscript𝑞0subscript𝑞1subscript𝑞superscript2𝑏1\mathbf{\hat{V}}=Q_{b}(\mathbf{V})\in\{q_{0},q_{1},...,q_{2^{b}-1}\}over^ start_ARG bold_V end_ARG = italic_Q start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( bold_V ) ∈ { italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_q start_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT - 1 end_POSTSUBSCRIPT } (1)

where qiRsubscript𝑞𝑖𝑅q_{i}\in Ritalic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_R is the quantization level, and 2b1superscript2𝑏12^{b}-12 start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT - 1 is the number of quantization levels. In this paper, a per-tensor, symmetric uniform quantization scheme is adopted for quantizing both the weight tensor and feature map (i.e., activation) tensor. Generally, QAT requires training the network with simulated quantization. In the forward propagation, the simulated quantization function can be formulated as:

v¯=clip(vs,Nmin,Nmax),v^=sv¯\bar{v}=clip(\lfloor\frac{v}{s}\rceil,N_{min},N_{max}),\hat{v}=s\cdot\bar{v}over¯ start_ARG italic_v end_ARG = italic_c italic_l italic_i italic_p ( ⌊ divide start_ARG italic_v end_ARG start_ARG italic_s end_ARG ⌉ , italic_N start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ) , over^ start_ARG italic_v end_ARG = italic_s ⋅ over¯ start_ARG italic_v end_ARG (2)

where s𝑠sitalic_s is called quantization step size, and delimited-⌊⌉\lfloor\cdot\rceil⌊ ⋅ ⌉ serves as the rounding function to affine the float-point values to the nearest integers. For signed data, Nmin=2b1subscript𝑁𝑚𝑖𝑛superscript2𝑏1N_{min}=-2^{b-1}italic_N start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT = - 2 start_POSTSUPERSCRIPT italic_b - 1 end_POSTSUPERSCRIPT and Nmax=2b11subscript𝑁𝑚𝑎𝑥superscript2𝑏11N_{max}=2^{b-1}-1italic_N start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT = 2 start_POSTSUPERSCRIPT italic_b - 1 end_POSTSUPERSCRIPT - 1. For unsigned data, Nmin=0subscript𝑁𝑚𝑖𝑛0N_{min}=0italic_N start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT = 0 and Nmax=2b1subscript𝑁𝑚𝑎𝑥superscript2𝑏1N_{max}=2^{b}-1italic_N start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT = 2 start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT - 1. Specifically, Eq. (2) first quantizes values into the integer domain and then undergoes de-quantization to undo the scaling step. The effect of quantization is thus simulated while retaining the original scale of the input tensor. In the backward pass, STE [16] is generally introduced to approximate the gradients of the rounding function to 1. Therefore, the local gradient of the v^^𝑣\hat{v}over^ start_ARG italic_v end_ARG with respect to v𝑣vitalic_v can be defined as:

v^v=1,NminvsNmaxformulae-sequence^𝑣𝑣1subscript𝑁𝑚𝑖𝑛𝑣𝑠subscript𝑁𝑚𝑎𝑥\frac{\partial\hat{v}}{\partial v}=1,N_{min}\leq\frac{v}{s}\leq N_{max}divide start_ARG ∂ over^ start_ARG italic_v end_ARG end_ARG start_ARG ∂ italic_v end_ARG = 1 , italic_N start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ≤ divide start_ARG italic_v end_ARG start_ARG italic_s end_ARG ≤ italic_N start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT (3)

During the QAT process, the parameters in the model can accept gradient backpropagation and updates. Thus, our subsequent proposals of task-related losses can effectively promote the updates and optimization of model parameters.

II-B Harmonious Quantization for Object Detection

II-B1 Task Inharmony in Object Detection


A standard loss of object detector can be revisited as follows:

OD=1P(iPosP(clsi+regi)+jNegNclsj)subscriptOD1𝑃superscriptsubscript𝑖Pos𝑃superscriptsubscriptcls𝑖superscriptsubscriptreg𝑖superscriptsubscript𝑗Neg𝑁superscriptsubscriptcls𝑗\mathcal{L}_{\mathrm{OD}}=\frac{1}{P}(\sum_{i\in\mathrm{Pos}}^{P}(\mathcal{L}_% {\mathrm{cls}}^{i}+\mathcal{L}_{\mathrm{reg}}^{i})+\sum_{j\in\mathrm{Neg}}^{N}% \mathcal{L}_{\mathrm{cls}}^{j})caligraphic_L start_POSTSUBSCRIPT roman_OD end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_P end_ARG ( ∑ start_POSTSUBSCRIPT italic_i ∈ roman_Pos end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ( caligraphic_L start_POSTSUBSCRIPT roman_cls end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + caligraphic_L start_POSTSUBSCRIPT roman_reg end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_j ∈ roman_Neg end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_cls end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) (4)

where P𝑃Pitalic_P and N𝑁Nitalic_N are the number of positive and negative samples, respectively. clssubscriptcls\mathcal{L}_{\mathrm{cls}}caligraphic_L start_POSTSUBSCRIPT roman_cls end_POSTSUBSCRIPT and regsubscriptreg\mathcal{L}_{\mathrm{reg}}caligraphic_L start_POSTSUBSCRIPT roman_reg end_POSTSUBSCRIPT denote the optimization loss for classification and bounding box regression, depending on different models, respectively.

The classification and regression branches are trained with separate objective functions for each positive sample, resulting in significant inconsistency between the classification and regression tasks. In other words, clssubscriptcls\mathcal{L}_{\mathrm{cls}}caligraphic_L start_POSTSUBSCRIPT roman_cls end_POSTSUBSCRIPT encourages the model to learn high classification scores during training, irrespective of the localization scores, while regsubscriptreg\mathcal{L}_{\mathrm{reg}}caligraphic_L start_POSTSUBSCRIPT roman_reg end_POSTSUBSCRIPT aims to improve localization ability, regardless of classification results. Consequently, the predicted classification scores and localization scores become detached from each other, leading to inconsistent detections. These inconsistent detections may exhibit high classification scores but low IoUs, or low classification scores but high IoUs, ultimately compromising the overall detection performance after NMS.

II-B2 Task-Correlated Loss

Refer to caption
((a)) βcls=0.5,βreg=0.5formulae-sequencesubscript𝛽cls0.5subscript𝛽reg0.5\beta_{\mathrm{cls}}=0.5\hskip 4.0pt,\hskip 4.0pt\beta_{\mathrm{reg}}=0.5italic_β start_POSTSUBSCRIPT roman_cls end_POSTSUBSCRIPT = 0.5 , italic_β start_POSTSUBSCRIPT roman_reg end_POSTSUBSCRIPT = 0.5
Refer to caption
((b)) βcls=u,βreg=pformulae-sequencesubscript𝛽cls𝑢subscript𝛽reg𝑝\beta_{\mathrm{cls}}=u\hskip 4.0pt,\hskip 4.0pt\beta_{\mathrm{reg}}=pitalic_β start_POSTSUBSCRIPT roman_cls end_POSTSUBSCRIPT = italic_u , italic_β start_POSTSUBSCRIPT roman_reg end_POSTSUBSCRIPT = italic_p
Figure 3: Visualization of the task correlation indicator c𝑐citalic_c with different setting of βclssubscript𝛽cls\beta_{\mathrm{cls}}italic_β start_POSTSUBSCRIPT roman_cls end_POSTSUBSCRIPT and βregsubscript𝛽reg\beta_{\mathrm{reg}}italic_β start_POSTSUBSCRIPT roman_reg end_POSTSUBSCRIPT in Eq. (5).

During the QAT procedure, the standard detection loss in Eq. (4) will be enabled to update and optimize weights and quantization step size. However, noise introduced by quantization will obstruct the process of optimization, further exacerbating the task inharmony problem, especially in low-bit quantization. To tackle this problem, we first introduce a task correlation indicator to properly elucidate the harmony quality of a sample, defined as follows:

ci=piβclsuiβregsubscript𝑐𝑖superscriptsubscript𝑝𝑖subscript𝛽clssuperscriptsubscript𝑢𝑖subscript𝛽regc_{i}=p_{i}^{\beta_{\mathrm{cls}}}*u_{i}^{\beta_{\mathrm{reg}}}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT roman_cls end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∗ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT roman_reg end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (5)

where pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the classification score of i𝑖iitalic_i-th positive sample, output by classification branch. uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the IoU score of i𝑖iitalic_i-th positive sample predicted from the regression branch. Note that the range of both classification score and IoU is [0,1]01[0,1][ 0 , 1 ], therefore the range of the task correlation indicator is [0,1]01[0,1][ 0 , 1 ]. This indicator has two basic characteristics. Firstly, it takes into consideration the score quality of the two branches. When c=1𝑐1c=1italic_c = 1, both p𝑝pitalic_p and u𝑢uitalic_u are equal to 1, indicating that both the classification and regression branches produce high-quality samples. When c=0𝑐0c=0italic_c = 0, it signifies that at least one of the scores of branches is 0, implying that the model outputs low-quality samples. Secondly, it also characterizes the consistency between the two branches. A smaller gap in scores between the two branches produces a higher c𝑐citalic_c, while a larger difference in the output results of the two branches produces a lower c𝑐citalic_c. βclssubscript𝛽cls\beta_{\mathrm{cls}}italic_β start_POSTSUBSCRIPT roman_cls end_POSTSUBSCRIPT and βregsubscript𝛽reg\beta_{\mathrm{reg}}italic_β start_POSTSUBSCRIPT roman_reg end_POSTSUBSCRIPT are two dynamic factors rather than two constants, and we define them as:

βcls=ui,βreg=piformulae-sequencesubscript𝛽clssubscript𝑢𝑖subscript𝛽regsubscript𝑝𝑖\beta_{\mathrm{cls}}=u_{i}\hskip 5.0pt,\hskip 5.0pt\beta_{\mathrm{reg}}=p_{i}italic_β start_POSTSUBSCRIPT roman_cls end_POSTSUBSCRIPT = italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT roman_reg end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (6)

By substituting Eq. (6) into Eq. (5), we can obtain the functional effect as illustrated in Fig. 3. When both p𝑝pitalic_p and u𝑢uitalic_u are low, c𝑐citalic_c does not exhibit a low value under the influence of dynamic factors; instead, c𝑐citalic_c tends to increase. We posit that when both task scores of a sample are low, there is less need to excessively discuss the harmony issue of that sample, while c𝑐citalic_c will attain a relatively higher level. In this manner, only two types of samples will be given special attention and assigned lower c𝑐citalic_c, preventing a large number of low-quality positive samples from dominating the harmonious optimization process. One type consists of the sample with a high score on one of the tasks but with a significant score gap between tasks. The other type includes the sample that exhibits a small score gap between tasks but with both task scores hovering around 0.5. Subsequently, we propose Task-Correlated (TCorr) loss to focus on and optimize the samples with lower c𝑐citalic_c, as shown in Eq. (7).

TCorri=α(ecie1),α=(1+|piui|)formulae-sequencesuperscriptsubscriptTCorr𝑖𝛼superscript𝑒subscript𝑐𝑖superscript𝑒1𝛼1subscript𝑝𝑖subscript𝑢𝑖\mathcal{L}_{\mathrm{TCorr}}^{i}=\alpha(e^{-c_{i}}-e^{-1})\hskip 6.00006pt,% \hskip 6.00006pt\alpha=(1+|p_{i}-u_{i}|)caligraphic_L start_POSTSUBSCRIPT roman_TCorr end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_α ( italic_e start_POSTSUPERSCRIPT - italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_e start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) , italic_α = ( 1 + | italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ) (7)

where α𝛼\alphaitalic_α serves as the reweighting parameter and does not accept gradient backpropagation. The magnitude of the score gap between tasks directly corresponds to the increasing emphasis on the loss function.

II-B3 Harmonious IoU Loss


After using TCorr, there is a reduction in the proportion of positive samples at lower IoU levels, aligning with our expected outcome as illustrated in Fig. 4. However, we observe a slight decrease in the proportion of positive samples at higher IoU levels, particularly within the IoU range of 0.9 to 1.0, contrary to our expectations. We consider that this phenomenon arises due to the dominance of samples with lower IoU levels in the optimization process of the regression branch, thereby dominating the directions of gradient updates. Generally, detectors tend to generate a substantial number of positive bounding boxes for the optimization of localization losses. However, a significant proportion of these samples exhibit low IoU levels, consequently exerting a predominant influence on the optimization direction of the localization loss. To mitigate this biased optimization, Harmonious IoU (HIoU) loss is introduced and defined as follows:

HIoUi=(1+ui)γ(1ui)superscriptsubscript𝐻𝐼𝑜𝑈𝑖superscript1subscript𝑢𝑖𝛾1subscript𝑢𝑖\mathcal{L}_{HIoU}^{i}=(1+u_{i})^{\gamma}(1-u_{i})caligraphic_L start_POSTSUBSCRIPT italic_H italic_I italic_o italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = ( 1 + italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT ( 1 - italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (8)

In our experiment, the hyperparameter γ𝛾\gammaitalic_γ is set to 0.8. The adaptive weight (1+ui)γsuperscript1subscript𝑢𝑖𝛾(1+u_{i})^{\gamma}( 1 + italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT is utilized to harmonize the localization loss across various IoU levels. Thus, HIoU loss enhances the weighting of the localization loss for samples with high IoU and simultaneously suppresses the weighting of samples with low IoU through a dynamic scaling factor. By additionally utilizing HIoU loss, the model exhibits a reduction in the output of samples with low IoU quality while generating more samples with higher IoU quality, as depicted in Fig. 4.

II-B4 Overall Optimization Objective


By incorporating the TCorr loss and the HIoU loss with Eq. (4), the overall optimization objective of the proposed Harmonious Quantization for Object Detection framework can be written as follows:

HQOD=OD+1PiPosP(Tcorri+σHIoUi)subscriptHQODsubscriptOD1𝑃superscriptsubscript𝑖Pos𝑃superscriptsubscriptTcorr𝑖𝜎superscriptsubscriptHIoU𝑖\mathcal{L}_{\mathrm{HQOD}}=\mathcal{L}_{\mathrm{OD}}+\frac{1}{P}\sum_{i\in% \mathrm{Pos}}^{P}(\mathcal{L}_{\mathrm{Tcorr}}^{i}+\sigma\mathcal{L}_{\mathrm{% HIoU}}^{i})caligraphic_L start_POSTSUBSCRIPT roman_HQOD end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT roman_OD end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_P end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ roman_Pos end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ( caligraphic_L start_POSTSUBSCRIPT roman_Tcorr end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_σ caligraphic_L start_POSTSUBSCRIPT roman_HIoU end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) (9)

The trade-off parameter σ𝜎\sigmaitalic_σ is set as 1.5 in our experiments. Detectors are trained by optimizing all these losses in an end-to-end manner during the QAT procedure.

III Experiments

III-A Experimental Settings

Datasets. We conduct extensive experiments on two widely used object detection datasets, MS COCO [15] and PASCAL VOC [14]. For MS COCO, models are trained on the 118k training split and evaluated on the minimal split. For PASCAL VOC, the union of VOC2007 trainval and VOC2012 trainval is utilized for training, while the evaluation is performed on the VOC2007 test split. Both two datasets are evaluated by standard MS COCO metrics with mean average precision (mAP), AP50subscriptAP50\mathrm{AP}_{50}roman_AP start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT, and AP75subscriptAP75\mathrm{AP}_{75}roman_AP start_POSTSUBSCRIPT 75 end_POSTSUBSCRIPT , which displays more comprehensive performance details.

Object Detection Baselines. Four popular object detectors are used in our experiments, including RetinaNet [2], ATSS [3], and two lightweight models, SSD-lite [1] and YOLOX-tiny [4]. ResNet-18 [13], ResNet-50 [13], MobileNetV2 [17], and CSPDarknet [18], pre-trained on the standard classification task [19], are used as our backbone networks. All full-precision models on MS COCO and training process are obtained from open-source code, MMDetection [20].

Refer to caption
Figure 4: The average improvement compared to baseline LSQ [8] for positive samples from different IoU intervals.

Quantization Baselines. LSQ [8], TQT [11], and AQD [12] are followed as QAT baseline methods. LSQ and TQT are implemented by MQBench [21]. For a fair comparison, we implement AQD [12] based on MQBench, eliminating the influences from different quantization settings. Note that we only implement the essence of AQD, which introduces Sync-BN [12] in the head of the Detector.

QAT Settings. We adopt per-tensor uniform symmetric quantization as mentioned in Section II-A and quantize all the convolutional layers of a model. When a convolutional layer is quantized, it means that both the input feature maps and the layer weights are quantized under the same bit constraint. The first and last layers of all the detectors are only quantized to 8-bit widths. For the workflow and hyper-parameters of QAT, please refer to Section A of the supplementary materials.

TABLE I: Performance for RetinaNet with backbone ResNet-18 on PASCAL VOC. * represents our re-implementation. BW represents the global bit width.
Method BW mAP AP50subscriptAP50\mathrm{AP}_{50}roman_AP start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT AP75subscriptAP75\mathrm{AP}_{75}roman_AP start_POSTSUBSCRIPT 75 end_POSTSUBSCRIPT
Full-precision FP32 50.4 78.9 53.8
\cdashline1-5[4pt/4pt]
LSQ[8] INT4 50.2 78.6 53.4
LSQ+HQOD INT4 50.8 78.7 53.9
\cdashline1-5[4pt/4pt]
LSQ[8] INT2 47.9 75.3 50.6
LSQ+HQOD INT2 48.8 75.3 51.5
Sync-BN [12] FP32 51.3 80.1 55.1
\cdashline1-5[4pt/4pt]
AQD*[12] INT4 50.7 79.5 54.3
AQD*+HQOD INT4 51.3 79.5 55.6
\cdashline1-5[4pt/4pt]
AQD*[12] INT2 48.1 75.6 51.0
AQD*+HQOD INT2 48.8 75.2 52.8

III-B Comparisons to SOTA Quantization Methods

The quantization performance of 4/2-bits detectors is shown in Tables I, II, and III. Additionally, we include performance results of full-precision detectors for comparison.

III-B1 Experiments on PASCAL VOC


From Table I, we can observe that our HQOD demonstrates a significant improvement in mAP across different bit widths. Specifically, under the 4-bit configuration, our HQOD can outperform both LSQ and AQD by 0.6% mAP. Even under the extreme 2-bit constraints, a relative enhancement of 0.9% and 0.7% mAP is obtained compared to the baseline LSQ and AQD, respectively.

TABLE II: Performance for 4-bit detectors on MS COCO.
Model Method BW mAP AP50subscriptAP50\mathrm{AP}_{50}roman_AP start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT AP75subscriptAP75\mathrm{AP}_{75}roman_AP start_POSTSUBSCRIPT 75 end_POSTSUBSCRIPT
RetinaNet (ResNet-18) Full-precision FP32 31.7 49.6 33.4
\cdashline2-6[4pt/4pt]
LSQ [8] INT4 31.4 49.3 33.0
LSQ+HQOD INT4 32.5 50.0 34.2
Sync-BN [12] FP32 32.0 49.7 33.8
\cdashline2-6[4pt/4pt]
AQD* [12] INT4 32.1 50.4 33.9
AQD*+HQOD INT4 33.1 50.9 35.0
RetinaNet (ResNet-50) Full-precision FP32 37.4 56.7 39.6
\cdashline2-6[4pt/4pt]
LSQ[8] INT4 35.1 53.9 37.3
LSQ+HQOD INT4 36.9 55.4 39.5
ATSS (ResNet-50) Full-precision FP32 39.4 57.6 42.8
\cdashline2-6[4pt/4pt]
LSQ[8] INT4 39.0 57.8 41.9
LSQ+HQOD INT4 39.6 58.1 42.5
\cdashline2-6[4pt/4pt]
TQT[11] INT4 39.0 57.9 41.9
TQT+HQOD INT4 39.5 58.0 42.4
SSD-lite (MobileNetV2) Full-precision FP32 21.3 35.4 21.8
\cdashline2-6[4pt/4pt]
LSQ[8] INT4 18.6 31.6 19.0
LSQ+HQOD INT4 18.8 31.8 19.0
YOLOX-tiny (CSPDarknet) Full-precision FP32 31.8 49.1 33.8
\cdashline2-6[4pt/4pt]
LSQ[8] INT4 24.2 40.2 25.5
LSQ+HQOD INT4 24.7 41.2 25.8

III-B2 Experiments on MS COCO


Results for 4-bit detectors are presented in Table II. For RetinaNet with ResNet-18 backbone, our HQOD loss can improve the mAP by 1.1%, surpassing the performance of FP32 by 0.8%. Based on AQD, our method can even achieve the best 33.1% mAP, surpassing the original full-precision version by an additional 1.4%. For RetinaNet with deeper backbone ResNet-50, our method also outperforms LSQ by 1.8%. On ATSS with ResNet-50, HQOD loss can improve the mAP of LSQ and TQT by 0.6% and 0.5%, respectively. Especially when the IoU threshold is 0.75 (AP75subscriptAP75\mathrm{AP}_{75}roman_AP start_POSTSUBSCRIPT 75 end_POSTSUBSCRIPT), the improvement is also significant, gaining 0.6% and 0.5%, respectively. These results prove that the QAT methods trained with our HQOD loss can produce more accurate bboxes. In the lightweight detectors with constrained learning capabilities, our proposed method yields respective performance improvements of 0.2% and 0.5% for SSD-lite and YOLOX-tiny, as compared to LSQ.

TABLE III: Performance for 2-bit detectors on MS COCO.
Model Method BW mAP AP50subscriptAP50\mathrm{AP}_{50}roman_AP start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT AP75subscriptAP75\mathrm{AP}_{75}roman_AP start_POSTSUBSCRIPT 75 end_POSTSUBSCRIPT
RetinaNet (ResNet-18) Full-precision FP32 31.7 49.6 33.4
\cdashline2-6[4pt/4pt]
LSQ[8] INT2 29.3 46.7 30.3
LSQ+HarDet* [22] INT2 28.8 43.1 30.8
LSQ+HQOD INT2 30.7 47.4 32.3
Sync-BN [12] FP32 32.0 49.7 33.8
\cdashline2-6[4pt/4pt]
AQD*[12] INT2 28.9 46.2 30.4
AQD*+HQOD INT2 30.1 46.7 31.8
ATSS (ResNet-50) Full-precision FP32 39.4 57.6 42.8
\cdashline2-6[4pt/4pt]
TQT[11] INT2 33.5 51.1 35.4
TQT+HQOD INT2 34.8 51.9 37.1

We extend the experiments to a more challenging 2-bit constraint. As shown in Table III, both LSQ and AQD are evaluated on the RetinaNet framework, and the proposed HQOD loss can achieve an additional gain of 1.4% and 1.2% mAP, respectively. We also re-implement the HarDet [22] method as a comparison, which aims to promote harmony between tasks for full-precision detectors. However, HarDet is significantly poor in 2-bit constraint, with a 0.5% degradation in mAP. Then for TQT on ATSS, our HQOD loss can further improve the mAP performance by 1.3%, with 0.8% and 1.7% recovery in AP50subscriptAP50\mathrm{AP}_{50}roman_AP start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT and AP75subscriptAP75\mathrm{AP}_{75}roman_AP start_POSTSUBSCRIPT 75 end_POSTSUBSCRIPT, respectively. In summary, our proposed method has the potential to further boost the performance of state-of-the-art QAT methods.

TABLE IV: Ablation Study on RetinaNet with backbone ResNet-18 under 2-bit constraint.
Method mAP AP50subscriptAP50\mathrm{AP}_{50}roman_AP start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT AP75subscriptAP75\mathrm{AP}_{75}roman_AP start_POSTSUBSCRIPT 75 end_POSTSUBSCRIPT
LSQ 29.3 46.7 30.3
\hdashline
+TCorr 29.9 47.6 31.6
+TCorr+HIoU 30.7 47.4 32.3

III-C Ablation Study

We utilize LSQ as the baseline and employ a 2-bit RetinaNet with ResNet-18 detector to conduct ablation experiments on the MS COCO dataset. As shown in Table IV, our proposed TCorr loss improves mAP by 0.6%. After additionally introducing Harmonious IoU (HIoU) loss, the extended improvement of 0.8% is achieved, resulting in a final mAP of 30.7%.

Refer to caption
Figure 5: The distribution of the gap value between IoU and classification score (Cls) based on RetinaNet with ResNet-18.

III-D Quantitative Analysis

We conduct quantitative analysis on the MS COCO dataset. Utilizing the absolute gap between the classification score and IoU as the evaluation metric, the statistics of TP samples can be depicted in Fig. 5. After LSQ, the proportion of samples with score gaps from 0 to 0.1 significantly decreases, while the proportion of gaps ranging from 0.2 to 0.4 increases. The degradation becomes more pronounced with the decrease in bit width. These observations indicate a further exacerbation of task inharmony after low-bit QAT. After introducing our HQOD method, the proportion of gaps from 0 to 0.1 significantly increases, and inharmonious samples are effectively dampened. We also provide qualitative comparisons between state-of-the-art QAT algorithms and our method under 2-bit constraints, as illustrated in Fig. 6. Promoting the optimization of detectors toward task harmony during the QAT phase is imperative. Our HQOD framework enables low-bit detectors to generate more accurate detection bounding boxes, effectively alleviating the inharmony between classification scores and localization scores. For more analysis, please refer to Section B of the supplementary materials.

w/o HQOD

w/ HQOD

w/o HQOD

w/ HQOD

Refer to captionRefer to captionRefer to captionRefer to caption
((a)) LSQ
Refer to captionRefer to captionRefer to captionRefer to caption
((b)) AQD
Refer to captionRefer to captionRefer to captionRefer to caption
((c)) TQT
Figure 6: Qualitative comparisons between the state-of-the-art QAT methods with or without our HQOD framework. Our HQOD framework concurrently enhances the accuracy of both classification and regression results, producing more harmonious samples.

IV Conclusion

In this work, we identify that the task inharmony problem is exacerbated when the object detectors undergo quantization-aware training (QAT) and more pronounced with the bit width decrease, which is one of the primary issues leading to the performance degradation of the quantized detectors. To foster a more harmonious QAT process, we propose the Harmonious Quantization for Object Detection (HQOD) framework, which consists of two losses: Task-Correlated (TCorr) loss and Harmonious IoU (HIoU) loss. TCorr loss makes detectors focus more on optimizing samples with lower task harmony quality, and HIoU loss balances the optimization of the regression branch across different IoU levels during QAT. The combination of proposed losses can be conveniently integrated into various state-of-the-art QAT methods, effectively enhancing the task harmony of low-bit detectors and improving their overall performance. The future work can be shifted to other more complex task domains such as semantic segmentation, 3D object detection, and so forth. We think that similar phenomena of exacerbated task inharmony after low-bit quantization also exist in these domains.

Acknowledgment

We thank anonymous reviewers for their kind advice on this research. This work is supported by the National Key Research and Development Program of China (2020AAA0109700), the National Science Fund for Distinguished Young Scholars (62125601), and the National Natural Science Foundation of China (62076024, 62006018).

References

  • [1] W. Liu, D. Anguelov, D. Erhan et al., “Ssd: Single shot multibox detector,” in ECCV, 2016.
  • [2] T.-Y. Lin, P. Goyal, R. Girshick et al., “Focal loss for dense object detection,” in ICCV, 2017.
  • [3] S. Zhang, C. Chi, Y. Yao et al., “Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection,” in CVPR, 2020.
  • [4] Z. Ge, S. Liu, F. Wang et al., “Yolox: Exceeding yolo series in 2021,” arXiv:2107.08430, 2021.
  • [5] C. Feng, Y. Zhong, Y. Gao et al., “Tood: Task-aligned one-stage object detection,” in ICCV, 2021.
  • [6] R. Tang, Z. yu Liu, Y. Li et al., “Task-balanced distillation for object detection,” Pattern Recognit., 2022.
  • [7] J. Deng, D. Xu, W. Li et al., “Harmonious teacher for cross-domain object detection,” in CVPR, 2023.
  • [8] S. K. Esser, J. L. McKinstry, D. Bablani et al., “Learned step size quantization,” in ICLR, 2020.
  • [9] C. Yang, R. Zhang, L. Huang et al., “A survey of quantization methods for deep neural networks,” Chinese Journal of Engineering, 2023.
  • [10] R. Gong, X. Liu, S. Jiang et al., “Differentiable soft quantization: Bridging full-precision and low-bit neural networks,” in ICCV, 2019.
  • [11] S. Jain, A. Gural, M. Wu et al., “Trained quantization thresholds for accurate and efficient fixed-point inference of deep neural networks,” in MLSys, 2020.
  • [12] P. Chen, J. Liu, B. Zhuang et al., “Aqd: Towards accurate quantized object detection,” in CVPR, 2021.
  • [13] K. He, X. Zhang, S. Ren et al., “Deep residual learning for image recognition,” in CVPR, 2016.
  • [14] M. Everingham, S. A. Eslami, L. Van Gool et al., “The pascal visual object classes challenge: A retrospective,” IJCV, 2015.
  • [15] T.-Y. Lin, M. Maire, S. Belongie et al., “Microsoft coco: Common objects in context,” in ECCV, 2014.
  • [16] Y. Bengio, N. Léonard, and A. Courville, “Estimating or propagating gradients through stochastic neurons for conditional computation,” arXiv:1308.3432, 2013.
  • [17] A. G. Howard, M. Zhu, B. Chen et al., “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv:1704.04861, 2017.
  • [18] C.-Y. Wang, H.-Y. M. Liao, Y.-H. Wu et al., “Cspnet: A new backbone that can enhance learning capability of cnn,” in CVPR Workshop, 2020.
  • [19] J. Deng, W. Dong, R. Socher et al., “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009.
  • [20] K. Chen, J. Wang, J. Pang et al., “Mmdetection: Open mmlab detection toolbox and benchmark,” arXiv:1906.07155, 2019.
  • [21] Y. Li, M. Shen, J. Ma et al., “Mqbench: Towards reproducible and deployable model quantization benchmark,” in NeurIPS, 2021.
  • [22] K. Wang and L. Zhang, “Reconcile prediction consistency for balanced object detection,” in ICCV, 2021.