MobileDepth: Monocular Depth Estimation Based on Lightweight Vision Transformer
As deep learning takes off, monocular depth estimation based on convolutional neural networks (CNNs) has made impressive progress. CNNs are superior at extracting local characteristics from a single image; however, they are unable to manage long-range dependence and thus have a substantial impact on...
Saved in:
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Taylor & Francis Group
2024-12-01
|
Series: | Applied Artificial Intelligence |
Online Access: | https://www.tandfonline.com/doi/10.1080/08839514.2024.2364159 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1846119880876621824 |
---|---|
author | Yundong Li Xiaokun Wei |
author_facet | Yundong Li Xiaokun Wei |
author_sort | Yundong Li |
collection | DOAJ |
description | As deep learning takes off, monocular depth estimation based on convolutional neural networks (CNNs) has made impressive progress. CNNs are superior at extracting local characteristics from a single image; however, they are unable to manage long-range dependence and thus have a substantial impact on the performance of monocular depth estimation. In addition to this, as architectures based on CNNs frequently utilize down sampling operations, numbers of pixel-level features, which are extremely crucial for dense prediction tasks, are lost in the encoder phase. Unlike CNNs, ViT is capable of capturing global feature information, but it requires numbers of parameters and data augmentation owing to its lack of inductive bias. To address the aforementioned difficulties, in this study, we propose a Dilated Self Attention Block (DSAB) as well as a Local and Global Feature Extraction (LGFE) module. The former resolves the inference speed issue of standard ViT models, and we accomplish this by limiting the number of self-attention computations among tokens. The latter combines the advantages of CNNs and ViT, first extracting local representation information in low-dimensional space through standard convolution and then mapping the input tensor to high-dimensional space to capture global information, achieving the simultaneous extraction of global and local characteristics. |
format | Article |
id | doaj-art-f60bc44f092e4a1791f3d4bd3f754f50 |
institution | Kabale University |
issn | 0883-9514 1087-6545 |
language | English |
publishDate | 2024-12-01 |
publisher | Taylor & Francis Group |
record_format | Article |
series | Applied Artificial Intelligence |
spelling | doaj-art-f60bc44f092e4a1791f3d4bd3f754f502024-12-16T16:13:02ZengTaylor & Francis GroupApplied Artificial Intelligence0883-95141087-65452024-12-0138110.1080/08839514.2024.2364159MobileDepth: Monocular Depth Estimation Based on Lightweight Vision TransformerYundong Li0Xiaokun Wei1School of Information Science and Technology, North China University of Technology, Beijing, ChinaSchool of Information Science and Technology, North China University of Technology, Beijing, ChinaAs deep learning takes off, monocular depth estimation based on convolutional neural networks (CNNs) has made impressive progress. CNNs are superior at extracting local characteristics from a single image; however, they are unable to manage long-range dependence and thus have a substantial impact on the performance of monocular depth estimation. In addition to this, as architectures based on CNNs frequently utilize down sampling operations, numbers of pixel-level features, which are extremely crucial for dense prediction tasks, are lost in the encoder phase. Unlike CNNs, ViT is capable of capturing global feature information, but it requires numbers of parameters and data augmentation owing to its lack of inductive bias. To address the aforementioned difficulties, in this study, we propose a Dilated Self Attention Block (DSAB) as well as a Local and Global Feature Extraction (LGFE) module. The former resolves the inference speed issue of standard ViT models, and we accomplish this by limiting the number of self-attention computations among tokens. The latter combines the advantages of CNNs and ViT, first extracting local representation information in low-dimensional space through standard convolution and then mapping the input tensor to high-dimensional space to capture global information, achieving the simultaneous extraction of global and local characteristics.https://www.tandfonline.com/doi/10.1080/08839514.2024.2364159 |
spellingShingle | Yundong Li Xiaokun Wei MobileDepth: Monocular Depth Estimation Based on Lightweight Vision Transformer Applied Artificial Intelligence |
title | MobileDepth: Monocular Depth Estimation Based on Lightweight Vision Transformer |
title_full | MobileDepth: Monocular Depth Estimation Based on Lightweight Vision Transformer |
title_fullStr | MobileDepth: Monocular Depth Estimation Based on Lightweight Vision Transformer |
title_full_unstemmed | MobileDepth: Monocular Depth Estimation Based on Lightweight Vision Transformer |
title_short | MobileDepth: Monocular Depth Estimation Based on Lightweight Vision Transformer |
title_sort | mobiledepth monocular depth estimation based on lightweight vision transformer |
url | https://www.tandfonline.com/doi/10.1080/08839514.2024.2364159 |
work_keys_str_mv | AT yundongli mobiledepthmonoculardepthestimationbasedonlightweightvisiontransformer AT xiaokunwei mobiledepthmonoculardepthestimationbasedonlightweightvisiontransformer |