HMCFormer (hierarchical multi-scale convolutional transformer): a hybrid CNN+Transformer network for intelligent VIA screening
Cervical cancer ranks first in incidence among malignant tumors of the female reproductive system, and 80% of women who die from cervical cancer worldwide are from developing countries. Visual inspection with acetic acid (VIA) screening based on artificial intelligence-assisted diagnosis can provide...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
PeerJ Inc.
2025-08-01
|
| Series: | PeerJ Computer Science |
| Subjects: | |
| Online Access: | https://peerj.com/articles/cs-3088.pdf |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Cervical cancer ranks first in incidence among malignant tumors of the female reproductive system, and 80% of women who die from cervical cancer worldwide are from developing countries. Visual inspection with acetic acid (VIA) screening based on artificial intelligence-assisted diagnosis can provide a cheap and rapid screening method. This will attract more low-income women to volunteer for regular cervical cancer screening. However, current AI-based VIA screening studies either have low accuracy or require expensive equipment assistance. In this article, we propose the Hierarchical Multi-Scale Convolutional Transformer network, which combines the hierarchical feature extraction capability of Convolutional Neural Network (CNNs) and the global dependency modeling capability of Transformers to address the challenges of realizing intelligent VIA screening. Hierarchical multi-scale convolutional transformer (HMCFormer) can be divided into a Transformer branch and a CNN branch. The Transformer branch receives unenhanced lesion sample images, and the CNN branch receives lesion sample images enhanced by the proposed dual-color space-based image enhancement algorithm. The authors design a hierarchical multi-scale pixel excitation module for adaptive multi-scale and multi-level local feature extraction. The authors apply the structure of the Swin Transformer network with minor modifications in the global perception modeling process. In addition, the authors propose two feature fusion concepts: adaptive preprocessing and superiority-inferiority fusion, and design a feature fusion module based on these concepts, which significantly improves the collaborative ability of the Transformer branch and the CNN branch. The authors collected and summarized 5,000 samples suitable for VIA screening methods from public datasets provided by companies such as Intel and Google, forming the PCC5000 dataset. On this dataset, the proposed algorithm achieves a screening accuracy of 97.4% and a grading accuracy of 94.8%. |
|---|---|
| ISSN: | 2376-5992 |