Aligning to the teacher: multilevel feature-aligned knowledge distillation

Knowledge distillation is a technique for transferring knowledge from a teacher’s (large) model to a student’s (small) model. Usually, the features of the teacher model contain richer information, while the features of the student model carry less information. This leads to a poor distillation effec...

Full description

Saved in:
Bibliographic Details
Main Authors: Yang Zhang, Pan He, Chuanyun Xu, Jingyan Pang, Xiao Wang, Xinghai Yuan, Pengfei Lv, Gang Li
Format: Article
Language:English
Published: PeerJ Inc. 2025-08-01
Series:PeerJ Computer Science
Subjects:
Online Access:https://peerj.com/articles/cs-3075.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Knowledge distillation is a technique for transferring knowledge from a teacher’s (large) model to a student’s (small) model. Usually, the features of the teacher model contain richer information, while the features of the student model carry less information. This leads to a poor distillation effect in the process of knowledge transfer due to insufficient feature information and poor generalisation ability of the student model. In order to effectively reduce the feature differences between teacher–student models, we propose a Multilevel Feature Alignment Knowledge Distillation method (MFAKD), which includes a spatial dimension alignment module and a multibranch channel alignment (MBCA) module. The upsampling operation enables the student feature map to align with the teacher feature map in spatial dimensions, allowing students to learn more comprehensive teacher features. Moreover, MBCA achieves the alignment of the student feature maps and teacher feature maps on channels, which can transform the student features into a form more similar to the teacher features. Subsequently, the student model uses the discriminative classifiers from the pretrained teacher model to perform the student model inference. In summary, the student model can utilize their strengths and surpass the teacher model. The method was validated on the CIFAR-100 dataset and obtains state-of-the-art results. In particular, the classification accuracy of the student model WRN-40-2 exceeds that of the teacher model ResNet-8×4 by almost 2%. The method also demonstrated excellent performance on the Tiny ImageNet dataset.
ISSN:2376-5992