WAPS-Quant: Low-Bit Post-Training Quantization Using Weight-Activation Product Scaling
Post-Training Quantization (PTQ) has been effectively compressing neural networks into very few bits using a limited calibration dataset. Various quantization methods utilizing second-order error have been proposed and demonstrated good performance. However, at extremely low bits, the increase in qu...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10982219/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Post-Training Quantization (PTQ) has been effectively compressing neural networks into very few bits using a limited calibration dataset. Various quantization methods utilizing second-order error have been proposed and demonstrated good performance. However, at extremely low bits, the increase in quantization error is significant, hindering optimal performance. Previous second-order error-based PTQ methods relied solely on quantization scale values and weight rounding for quantization. We introduce a weight-activation product scaling method that, when used alongside weight rounding and scale value adjustments, effectively reduces quantization error even at very low bits. The proposed method compensates for the errors resulting from quantization, thereby achieving results closer to the original model. Additionally, the method effectively reduces the potential increase in computational and memory complexity through channel-wise grouping, shifting, and channel mixing techniques. Our method is validated on various CNNs, and extended to ViT and object detection models, showing strong generalization across architectures. We conducted tests on various CNN-based models to affirm the superiority of our proposed quantization scheme. Our proposed approach enhances accuracy in 2/4-bit quantization with less than 1.5% computational overhead, and hardware-level simulation confirms its suitability for real-time deplo1yment with negligible latency increase. Furthermore, hardware-level simulation on a silicon-proven ASIC NPU confirms that our method achieves higher accuracy with negligible latency overhead, making it practical for real-time edge deployment. |
|---|---|
| ISSN: | 2169-3536 |