Multiple Hierarchical Cross-Scale Transformer for Remote Sensing Scene Classification

The Transformer model can capture global contextual information but does not have an inherent inductive bias. In contrast, convolutional neural networks (CNNs) are highly praised in computer vision due to their strong inductive bias and local spatial correlation. To combine the advantages of the two...

Full description

Saved in:
Bibliographic Details
Main Authors: Dan Zhang, Wenping Ma, Licheng Jiao, Xu Liu, Yuting Yang, Fang Liu
Format: Article
Language:English
Published: MDPI AG 2024-12-01
Series:Remote Sensing
Subjects:
Online Access:https://www.mdpi.com/2072-4292/17/1/42
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841548962752888832
author Dan Zhang
Wenping Ma
Licheng Jiao
Xu Liu
Yuting Yang
Fang Liu
author_facet Dan Zhang
Wenping Ma
Licheng Jiao
Xu Liu
Yuting Yang
Fang Liu
author_sort Dan Zhang
collection DOAJ
description The Transformer model can capture global contextual information but does not have an inherent inductive bias. In contrast, convolutional neural networks (CNNs) are highly praised in computer vision due to their strong inductive bias and local spatial correlation. To combine the advantages of the two model types, we propose a multiple hierarchical cross-scale Transformer model that efficiently combines the Transformer model with CNNs and is specifically designed for complex remote sensing scene classification. Firstly, a feature pyramid network with attention aggregation extracts the multi-scale base features. Then, these base features are fed into the proposed multi-scale channel Transformer (MSCT) module to derive the global features with channel-wise attention. Additionally, the base features are also fed into the proposed hierarchical cross-scale Transformer (HCST) module, which can obtain multi-level cross-scale representations. Lastly, the outputs from both modules are taken into account to calculate the final classification score. The performance of the proposed method has been validated for its effectiveness on three public datasets: AID, UCM, and NWPU-RESISC45.
format Article
id doaj-art-042986599bf744e682eb558d2215fc96
institution Kabale University
issn 2072-4292
language English
publishDate 2024-12-01
publisher MDPI AG
record_format Article
series Remote Sensing
spelling doaj-art-042986599bf744e682eb558d2215fc962025-01-10T13:20:02ZengMDPI AGRemote Sensing2072-42922024-12-011714210.3390/rs17010042Multiple Hierarchical Cross-Scale Transformer for Remote Sensing Scene ClassificationDan Zhang0Wenping Ma1Licheng Jiao2Xu Liu3Yuting Yang4Fang Liu5Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, Xidian University, Xi’an 710071, ChinaKey Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, Xidian University, Xi’an 710071, ChinaKey Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, Xidian University, Xi’an 710071, ChinaKey Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, Xidian University, Xi’an 710071, ChinaKey Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, Xidian University, Xi’an 710071, ChinaKey Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, Xidian University, Xi’an 710071, ChinaThe Transformer model can capture global contextual information but does not have an inherent inductive bias. In contrast, convolutional neural networks (CNNs) are highly praised in computer vision due to their strong inductive bias and local spatial correlation. To combine the advantages of the two model types, we propose a multiple hierarchical cross-scale Transformer model that efficiently combines the Transformer model with CNNs and is specifically designed for complex remote sensing scene classification. Firstly, a feature pyramid network with attention aggregation extracts the multi-scale base features. Then, these base features are fed into the proposed multi-scale channel Transformer (MSCT) module to derive the global features with channel-wise attention. Additionally, the base features are also fed into the proposed hierarchical cross-scale Transformer (HCST) module, which can obtain multi-level cross-scale representations. Lastly, the outputs from both modules are taken into account to calculate the final classification score. The performance of the proposed method has been validated for its effectiveness on three public datasets: AID, UCM, and NWPU-RESISC45.https://www.mdpi.com/2072-4292/17/1/42hierarchicalmultiple cross-scaleTransformerscene classification
spellingShingle Dan Zhang
Wenping Ma
Licheng Jiao
Xu Liu
Yuting Yang
Fang Liu
Multiple Hierarchical Cross-Scale Transformer for Remote Sensing Scene Classification
Remote Sensing
hierarchical
multiple cross-scale
Transformer
scene classification
title Multiple Hierarchical Cross-Scale Transformer for Remote Sensing Scene Classification
title_full Multiple Hierarchical Cross-Scale Transformer for Remote Sensing Scene Classification
title_fullStr Multiple Hierarchical Cross-Scale Transformer for Remote Sensing Scene Classification
title_full_unstemmed Multiple Hierarchical Cross-Scale Transformer for Remote Sensing Scene Classification
title_short Multiple Hierarchical Cross-Scale Transformer for Remote Sensing Scene Classification
title_sort multiple hierarchical cross scale transformer for remote sensing scene classification
topic hierarchical
multiple cross-scale
Transformer
scene classification
url https://www.mdpi.com/2072-4292/17/1/42
work_keys_str_mv AT danzhang multiplehierarchicalcrossscaletransformerforremotesensingsceneclassification
AT wenpingma multiplehierarchicalcrossscaletransformerforremotesensingsceneclassification
AT lichengjiao multiplehierarchicalcrossscaletransformerforremotesensingsceneclassification
AT xuliu multiplehierarchicalcrossscaletransformerforremotesensingsceneclassification
AT yutingyang multiplehierarchicalcrossscaletransformerforremotesensingsceneclassification
AT fangliu multiplehierarchicalcrossscaletransformerforremotesensingsceneclassification