MobileDepth: Monocular Depth Estimation Based on Lightweight Vision Transformer

As deep learning takes off, monocular depth estimation based on convolutional neural networks (CNNs) has made impressive progress. CNNs are superior at extracting local characteristics from a single image; however, they are unable to manage long-range dependence and thus have a substantial impact on...

Full description

Saved in:
Bibliographic Details
Main Authors: Yundong Li, Xiaokun Wei
Format: Article
Language:English
Published: Taylor & Francis Group 2024-12-01
Series:Applied Artificial Intelligence
Online Access:https://www.tandfonline.com/doi/10.1080/08839514.2024.2364159
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1846119880876621824
author Yundong Li
Xiaokun Wei
author_facet Yundong Li
Xiaokun Wei
author_sort Yundong Li
collection DOAJ
description As deep learning takes off, monocular depth estimation based on convolutional neural networks (CNNs) has made impressive progress. CNNs are superior at extracting local characteristics from a single image; however, they are unable to manage long-range dependence and thus have a substantial impact on the performance of monocular depth estimation. In addition to this, as architectures based on CNNs frequently utilize down sampling operations, numbers of pixel-level features, which are extremely crucial for dense prediction tasks, are lost in the encoder phase. Unlike CNNs, ViT is capable of capturing global feature information, but it requires numbers of parameters and data augmentation owing to its lack of inductive bias. To address the aforementioned difficulties, in this study, we propose a Dilated Self Attention Block (DSAB) as well as a Local and Global Feature Extraction (LGFE) module. The former resolves the inference speed issue of standard ViT models, and we accomplish this by limiting the number of self-attention computations among tokens. The latter combines the advantages of CNNs and ViT, first extracting local representation information in low-dimensional space through standard convolution and then mapping the input tensor to high-dimensional space to capture global information, achieving the simultaneous extraction of global and local characteristics.
format Article
id doaj-art-f60bc44f092e4a1791f3d4bd3f754f50
institution Kabale University
issn 0883-9514
1087-6545
language English
publishDate 2024-12-01
publisher Taylor & Francis Group
record_format Article
series Applied Artificial Intelligence
spelling doaj-art-f60bc44f092e4a1791f3d4bd3f754f502024-12-16T16:13:02ZengTaylor & Francis GroupApplied Artificial Intelligence0883-95141087-65452024-12-0138110.1080/08839514.2024.2364159MobileDepth: Monocular Depth Estimation Based on Lightweight Vision TransformerYundong Li0Xiaokun Wei1School of Information Science and Technology, North China University of Technology, Beijing, ChinaSchool of Information Science and Technology, North China University of Technology, Beijing, ChinaAs deep learning takes off, monocular depth estimation based on convolutional neural networks (CNNs) has made impressive progress. CNNs are superior at extracting local characteristics from a single image; however, they are unable to manage long-range dependence and thus have a substantial impact on the performance of monocular depth estimation. In addition to this, as architectures based on CNNs frequently utilize down sampling operations, numbers of pixel-level features, which are extremely crucial for dense prediction tasks, are lost in the encoder phase. Unlike CNNs, ViT is capable of capturing global feature information, but it requires numbers of parameters and data augmentation owing to its lack of inductive bias. To address the aforementioned difficulties, in this study, we propose a Dilated Self Attention Block (DSAB) as well as a Local and Global Feature Extraction (LGFE) module. The former resolves the inference speed issue of standard ViT models, and we accomplish this by limiting the number of self-attention computations among tokens. The latter combines the advantages of CNNs and ViT, first extracting local representation information in low-dimensional space through standard convolution and then mapping the input tensor to high-dimensional space to capture global information, achieving the simultaneous extraction of global and local characteristics.https://www.tandfonline.com/doi/10.1080/08839514.2024.2364159
spellingShingle Yundong Li
Xiaokun Wei
MobileDepth: Monocular Depth Estimation Based on Lightweight Vision Transformer
Applied Artificial Intelligence
title MobileDepth: Monocular Depth Estimation Based on Lightweight Vision Transformer
title_full MobileDepth: Monocular Depth Estimation Based on Lightweight Vision Transformer
title_fullStr MobileDepth: Monocular Depth Estimation Based on Lightweight Vision Transformer
title_full_unstemmed MobileDepth: Monocular Depth Estimation Based on Lightweight Vision Transformer
title_short MobileDepth: Monocular Depth Estimation Based on Lightweight Vision Transformer
title_sort mobiledepth monocular depth estimation based on lightweight vision transformer
url https://www.tandfonline.com/doi/10.1080/08839514.2024.2364159
work_keys_str_mv AT yundongli mobiledepthmonoculardepthestimationbasedonlightweightvisiontransformer
AT xiaokunwei mobiledepthmonoculardepthestimationbasedonlightweightvisiontransformer