Incorporating convolutional and transformer architectures to enhance semantic segmentation of fine-resolution urban images

Though convolutional neural networks (CNN) exhibit promise in image semantic segmentation, they have limitations in capturing global context information, resulting in inaccuracies in segmenting small object features and object boundaries. This study introduces a hybrid network, ICTANet, which incorp...

Full description

Saved in:
Bibliographic Details
Main Authors: Xizi Yu, Shuang Li, Yu Zhang
Format: Article
Language:English
Published: Taylor & Francis Group 2024-12-01
Series:European Journal of Remote Sensing
Subjects:
Online Access:https://www.tandfonline.com/doi/10.1080/22797254.2024.2361768
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1846127709398237184
author Xizi Yu
Shuang Li
Yu Zhang
author_facet Xizi Yu
Shuang Li
Yu Zhang
author_sort Xizi Yu
collection DOAJ
description Though convolutional neural networks (CNN) exhibit promise in image semantic segmentation, they have limitations in capturing global context information, resulting in inaccuracies in segmenting small object features and object boundaries. This study introduces a hybrid network, ICTANet, which incorporate convolutional and Transformer architectures to improve the segmentation performance of fine-resolution remote sensing urban imagery. The ICTANet model is essentially a Transformer-based encoder-decoder structure. The dual-encoder architecture, which combines CNN and Swin Transformer modules, is designed to extract both global and local detail information. The feature information at various stages is collected by the Feature Extraction and Fusion modules (FEF), enabling multi-scale contextual information fusion. In addition, an Auxiliary Boundary Detection (ABD) module is introduced at the end of the decoder to enhance the model’s ability to capture object boundary information. Numerous ablation experiments have been conducted to demonstrate the efficacy of various components within the network. The testing results have proven that the proposed model can achieve satisfactory performance on the ISPRS Vaihingen and Potsdam dataset, with overall accuracies of 91.9% and 92.0%, respectively. Simultaneously, the proposed model is also compared to the current state-of-the-art methods, exhibiting competitive performance, particularly in the segmentation of diminutive objects like cars and trees.
format Article
id doaj-art-f5b2996c94f1439ea98beb7f74bc33b0
institution Kabale University
issn 2279-7254
language English
publishDate 2024-12-01
publisher Taylor & Francis Group
record_format Article
series European Journal of Remote Sensing
spelling doaj-art-f5b2996c94f1439ea98beb7f74bc33b02024-12-11T11:43:31ZengTaylor & Francis GroupEuropean Journal of Remote Sensing2279-72542024-12-0157110.1080/22797254.2024.2361768Incorporating convolutional and transformer architectures to enhance semantic segmentation of fine-resolution urban imagesXizi Yu0Shuang Li1Yu Zhang2MOE Key Laboratory of Western China’s Environmental Systems, College of Earth and Environmental Sciences, Lanzhou University, Lanzhou, ChinaGuangzhou Urban Planning & Design Survey Research Institute, Guangzhou, ChinaGuangzhou Urban Planning & Design Survey Research Institute, Guangzhou, ChinaThough convolutional neural networks (CNN) exhibit promise in image semantic segmentation, they have limitations in capturing global context information, resulting in inaccuracies in segmenting small object features and object boundaries. This study introduces a hybrid network, ICTANet, which incorporate convolutional and Transformer architectures to improve the segmentation performance of fine-resolution remote sensing urban imagery. The ICTANet model is essentially a Transformer-based encoder-decoder structure. The dual-encoder architecture, which combines CNN and Swin Transformer modules, is designed to extract both global and local detail information. The feature information at various stages is collected by the Feature Extraction and Fusion modules (FEF), enabling multi-scale contextual information fusion. In addition, an Auxiliary Boundary Detection (ABD) module is introduced at the end of the decoder to enhance the model’s ability to capture object boundary information. Numerous ablation experiments have been conducted to demonstrate the efficacy of various components within the network. The testing results have proven that the proposed model can achieve satisfactory performance on the ISPRS Vaihingen and Potsdam dataset, with overall accuracies of 91.9% and 92.0%, respectively. Simultaneously, the proposed model is also compared to the current state-of-the-art methods, exhibiting competitive performance, particularly in the segmentation of diminutive objects like cars and trees.https://www.tandfonline.com/doi/10.1080/22797254.2024.2361768Image semantic segmentationtransformersconvolutional neural networksfeature extractionremote sensing
spellingShingle Xizi Yu
Shuang Li
Yu Zhang
Incorporating convolutional and transformer architectures to enhance semantic segmentation of fine-resolution urban images
European Journal of Remote Sensing
Image semantic segmentation
transformers
convolutional neural networks
feature extraction
remote sensing
title Incorporating convolutional and transformer architectures to enhance semantic segmentation of fine-resolution urban images
title_full Incorporating convolutional and transformer architectures to enhance semantic segmentation of fine-resolution urban images
title_fullStr Incorporating convolutional and transformer architectures to enhance semantic segmentation of fine-resolution urban images
title_full_unstemmed Incorporating convolutional and transformer architectures to enhance semantic segmentation of fine-resolution urban images
title_short Incorporating convolutional and transformer architectures to enhance semantic segmentation of fine-resolution urban images
title_sort incorporating convolutional and transformer architectures to enhance semantic segmentation of fine resolution urban images
topic Image semantic segmentation
transformers
convolutional neural networks
feature extraction
remote sensing
url https://www.tandfonline.com/doi/10.1080/22797254.2024.2361768
work_keys_str_mv AT xiziyu incorporatingconvolutionalandtransformerarchitecturestoenhancesemanticsegmentationoffineresolutionurbanimages
AT shuangli incorporatingconvolutionalandtransformerarchitecturestoenhancesemanticsegmentationoffineresolutionurbanimages
AT yuzhang incorporatingconvolutionalandtransformerarchitecturestoenhancesemanticsegmentationoffineresolutionurbanimages