MobileDepth: Monocular Depth Estimation Based on Lightweight Vision Transformer

As deep learning takes off, monocular depth estimation based on convolutional neural networks (CNNs) has made impressive progress. CNNs are superior at extracting local characteristics from a single image; however, they are unable to manage long-range dependence and thus have a substantial impact on...

Full description

Saved in:

Bibliographic Details
Main Authors:	Yundong Li, Xiaokun Wei
Format:	Article
Language:	English
Published:	Taylor & Francis Group 2024-12-01
Series:	Applied Artificial Intelligence
Online Access:	https://www.tandfonline.com/doi/10.1080/08839514.2024.2364159
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1846119880876621824
author	Yundong Li Xiaokun Wei
author_facet	Yundong Li Xiaokun Wei
author_sort	Yundong Li
collection	DOAJ
description	As deep learning takes off, monocular depth estimation based on convolutional neural networks (CNNs) has made impressive progress. CNNs are superior at extracting local characteristics from a single image; however, they are unable to manage long-range dependence and thus have a substantial impact on the performance of monocular depth estimation. In addition to this, as architectures based on CNNs frequently utilize down sampling operations, numbers of pixel-level features, which are extremely crucial for dense prediction tasks, are lost in the encoder phase. Unlike CNNs, ViT is capable of capturing global feature information, but it requires numbers of parameters and data augmentation owing to its lack of inductive bias. To address the aforementioned difficulties, in this study, we propose a Dilated Self Attention Block (DSAB) as well as a Local and Global Feature Extraction (LGFE) module. The former resolves the inference speed issue of standard ViT models, and we accomplish this by limiting the number of self-attention computations among tokens. The latter combines the advantages of CNNs and ViT, first extracting local representation information in low-dimensional space through standard convolution and then mapping the input tensor to high-dimensional space to capture global information, achieving the simultaneous extraction of global and local characteristics.
format	Article
id	doaj-art-f60bc44f092e4a1791f3d4bd3f754f50
institution	Kabale University
issn	0883-9514 1087-6545
language	English
publishDate	2024-12-01
publisher	Taylor & Francis Group
record_format	Article
series	Applied Artificial Intelligence
spelling	doaj-art-f60bc44f092e4a1791f3d4bd3f754f502024-12-16T16:13:02ZengTaylor & Francis GroupApplied Artificial Intelligence0883-95141087-65452024-12-0138110.1080/08839514.2024.2364159MobileDepth: Monocular Depth Estimation Based on Lightweight Vision TransformerYundong Li0Xiaokun Wei1School of Information Science and Technology, North China University of Technology, Beijing, ChinaSchool of Information Science and Technology, North China University of Technology, Beijing, ChinaAs deep learning takes off, monocular depth estimation based on convolutional neural networks (CNNs) has made impressive progress. CNNs are superior at extracting local characteristics from a single image; however, they are unable to manage long-range dependence and thus have a substantial impact on the performance of monocular depth estimation. In addition to this, as architectures based on CNNs frequently utilize down sampling operations, numbers of pixel-level features, which are extremely crucial for dense prediction tasks, are lost in the encoder phase. Unlike CNNs, ViT is capable of capturing global feature information, but it requires numbers of parameters and data augmentation owing to its lack of inductive bias. To address the aforementioned difficulties, in this study, we propose a Dilated Self Attention Block (DSAB) as well as a Local and Global Feature Extraction (LGFE) module. The former resolves the inference speed issue of standard ViT models, and we accomplish this by limiting the number of self-attention computations among tokens. The latter combines the advantages of CNNs and ViT, first extracting local representation information in low-dimensional space through standard convolution and then mapping the input tensor to high-dimensional space to capture global information, achieving the simultaneous extraction of global and local characteristics.https://www.tandfonline.com/doi/10.1080/08839514.2024.2364159
spellingShingle	Yundong Li Xiaokun Wei MobileDepth: Monocular Depth Estimation Based on Lightweight Vision Transformer Applied Artificial Intelligence
title	MobileDepth: Monocular Depth Estimation Based on Lightweight Vision Transformer
title_full	MobileDepth: Monocular Depth Estimation Based on Lightweight Vision Transformer
title_fullStr	MobileDepth: Monocular Depth Estimation Based on Lightweight Vision Transformer
title_full_unstemmed	MobileDepth: Monocular Depth Estimation Based on Lightweight Vision Transformer
title_short	MobileDepth: Monocular Depth Estimation Based on Lightweight Vision Transformer
title_sort	mobiledepth monocular depth estimation based on lightweight vision transformer
url	https://www.tandfonline.com/doi/10.1080/08839514.2024.2364159
work_keys_str_mv	AT yundongli mobiledepthmonoculardepthestimationbasedonlightweightvisiontransformer AT xiaokunwei mobiledepthmonoculardepthestimationbasedonlightweightvisiontransformer

MobileDepth: Monocular Depth Estimation Based on Lightweight Vision Transformer

Similar Items