WAPS-Quant: Low-Bit Post-Training Quantization Using Weight-Activation Product Scaling

Post-Training Quantization (PTQ) has been effectively compressing neural networks into very few bits using a limited calibration dataset. Various quantization methods utilizing second-order error have been proposed and demonstrated good performance. However, at extremely low bits, the increase in qu...

Full description

Saved in:
Bibliographic Details
Main Authors: Geunjae Choi, Kamin Lee, Nojun Kwak
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10982219/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Post-Training Quantization (PTQ) has been effectively compressing neural networks into very few bits using a limited calibration dataset. Various quantization methods utilizing second-order error have been proposed and demonstrated good performance. However, at extremely low bits, the increase in quantization error is significant, hindering optimal performance. Previous second-order error-based PTQ methods relied solely on quantization scale values and weight rounding for quantization. We introduce a weight-activation product scaling method that, when used alongside weight rounding and scale value adjustments, effectively reduces quantization error even at very low bits. The proposed method compensates for the errors resulting from quantization, thereby achieving results closer to the original model. Additionally, the method effectively reduces the potential increase in computational and memory complexity through channel-wise grouping, shifting, and channel mixing techniques. Our method is validated on various CNNs, and extended to ViT and object detection models, showing strong generalization across architectures. We conducted tests on various CNN-based models to affirm the superiority of our proposed quantization scheme. Our proposed approach enhances accuracy in 2/4-bit quantization with less than 1.5% computational overhead, and hardware-level simulation confirms its suitability for real-time deplo1yment with negligible latency increase. Furthermore, hardware-level simulation on a silicon-proven ASIC NPU confirms that our method achieves higher accuracy with negligible latency overhead, making it practical for real-time edge deployment.
ISSN:2169-3536