Lightweight Self-Supervised Monocular Depth Estimation Through CNN and Transformer Integration
Self-supervised monocular depth estimation is a promising research area due to its ability to train models without relying on expensive and difficult-to-obtain ground truth depth labels. In this domain, models often employ Convolutional Neural Networks (CNNs) and Transformers for feature extraction....
        Saved in:
      
    
          | Main Authors: | , , , , | 
|---|---|
| Format: | Article | 
| Language: | English | 
| Published: | IEEE
    
        2024-01-01 | 
| Series: | IEEE Access | 
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10749800/ | 
| Tags: | Add Tag 
      No Tags, Be the first to tag this record!
   | 
| Summary: | Self-supervised monocular depth estimation is a promising research area due to its ability to train models without relying on expensive and difficult-to-obtain ground truth depth labels. In this domain, models often employ Convolutional Neural Networks (CNNs) and Transformers for feature extraction. While CNNs excel at capturing local features, they struggle with global information due to their limited receptive field. On the other hand, Transformers can capture global features but are computationally expensive. To balance performance and computational efficiency, this paper proposes a lightweight self-supervised monocular depth estimation model that integrates CNN and Transformer architectures. The model introduces an Agent Attention mechanism to effectively model global context while significantly reducing computational complexity. Furthermore, spatial and channel restructured convolution techniques are utilized to minimize the computational cost associated with redundant feature extraction in visual tasks. Validation on the KITTI dataset shows that the model reaches an Absolute Relative Error of 0.104 and a Squared Relative Error of 0.757 while maintaining a nearly constant number of parameters. The accuracy improved to 0.889, with computational complexity (FLOPs) reduced to 4.993G, and training time decreased from 15.5 hours to 13.5 hours. The model also demonstrated strong generalization on the Make 3D dataset, with only 3.0M parameters and low computational complexity, indicating its suitability for resource-constrained devices. | 
|---|---|
| ISSN: | 2169-3536 | 
 
       