Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance

Abstract Multi-modal large language models (MLLMs) have demonstrated impressive performance in vision-language tasks across a wide range of domains. However, the large model scale and associated high computational cost pose significant challenges for training and deploying MLLMs on consumer-grade GP...

Full description

Saved in:
Bibliographic Details
Main Authors: Zhangwei Gao, Zhe Chen, Erfei Cui, Yiming Ren, Weiyun Wang, Jinguo Zhu, Hao Tian, Shenglong Ye, Junjun He, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Jifeng Dai, Wenhai Wang
Format: Article
Language:English
Published: Springer 2024-12-01
Series:Visual Intelligence
Subjects:
Online Access:https://doi.org/10.1007/s44267-024-00067-6
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1846121760977584128
author Zhangwei Gao
Zhe Chen
Erfei Cui
Yiming Ren
Weiyun Wang
Jinguo Zhu
Hao Tian
Shenglong Ye
Junjun He
Xizhou Zhu
Lewei Lu
Tong Lu
Yu Qiao
Jifeng Dai
Wenhai Wang
author_facet Zhangwei Gao
Zhe Chen
Erfei Cui
Yiming Ren
Weiyun Wang
Jinguo Zhu
Hao Tian
Shenglong Ye
Junjun He
Xizhou Zhu
Lewei Lu
Tong Lu
Yu Qiao
Jifeng Dai
Wenhai Wang
author_sort Zhangwei Gao
collection DOAJ
description Abstract Multi-modal large language models (MLLMs) have demonstrated impressive performance in vision-language tasks across a wide range of domains. However, the large model scale and associated high computational cost pose significant challenges for training and deploying MLLMs on consumer-grade GPUs or edge devices, thereby hindering their widespread application. In this work, we introduce Mini-InternVL, a series of MLLMs with parameters ranging from 1 billion to 4 billion, which achieves 90% of the performance with only 5% of the parameters. This significant improvement in efficiency and effectiveness makes our models more accessible and applicable in various real-world scenarios. To further promote the adoption of our models, we are developing a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks, including autonomous driving, medical image processing, and remote sensing. We believe that our models can provide valuable insights and resources to advance the development of efficient and effective MLLMs.
format Article
id doaj-art-fb58e6071c0d4a7998ba8b9a8f4f8a61
institution Kabale University
issn 2731-9008
language English
publishDate 2024-12-01
publisher Springer
record_format Article
series Visual Intelligence
spelling doaj-art-fb58e6071c0d4a7998ba8b9a8f4f8a612024-12-15T12:13:40ZengSpringerVisual Intelligence2731-90082024-12-012111710.1007/s44267-024-00067-6Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performanceZhangwei Gao0Zhe Chen1Erfei Cui2Yiming Ren3Weiyun Wang4Jinguo Zhu5Hao Tian6Shenglong Ye7Junjun He8Xizhou Zhu9Lewei Lu10Tong Lu11Yu Qiao12Jifeng Dai13Wenhai Wang14Shanghai AI LaboratoryShanghai AI LaboratoryShanghai AI LaboratoryShanghai AI LaboratoryShanghai AI LaboratoryShanghai AI LaboratorySenseTime ResearchShanghai AI LaboratoryShanghai AI LaboratoryDepartment of Electronic Engineering, Tsinghua UniversitySenseTime ResearchSchool of Computer Science, Nanjing UniversityShanghai AI LaboratoryDepartment of Electronic Engineering, Tsinghua UniversityDepartment of Information Engineering, The Chinese University of Hong KongAbstract Multi-modal large language models (MLLMs) have demonstrated impressive performance in vision-language tasks across a wide range of domains. However, the large model scale and associated high computational cost pose significant challenges for training and deploying MLLMs on consumer-grade GPUs or edge devices, thereby hindering their widespread application. In this work, we introduce Mini-InternVL, a series of MLLMs with parameters ranging from 1 billion to 4 billion, which achieves 90% of the performance with only 5% of the parameters. This significant improvement in efficiency and effectiveness makes our models more accessible and applicable in various real-world scenarios. To further promote the adoption of our models, we are developing a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks, including autonomous driving, medical image processing, and remote sensing. We believe that our models can provide valuable insights and resources to advance the development of efficient and effective MLLMs.https://doi.org/10.1007/s44267-024-00067-6Lightweight multi-modal large language modelVision-language modelKnowledge distillationVisual instruction tuning
spellingShingle Zhangwei Gao
Zhe Chen
Erfei Cui
Yiming Ren
Weiyun Wang
Jinguo Zhu
Hao Tian
Shenglong Ye
Junjun He
Xizhou Zhu
Lewei Lu
Tong Lu
Yu Qiao
Jifeng Dai
Wenhai Wang
Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance
Visual Intelligence
Lightweight multi-modal large language model
Vision-language model
Knowledge distillation
Visual instruction tuning
title Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance
title_full Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance
title_fullStr Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance
title_full_unstemmed Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance
title_short Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance
title_sort mini internvl a flexible transfer pocket multi modal model with 5 parameters and 90 performance
topic Lightweight multi-modal large language model
Vision-language model
Knowledge distillation
Visual instruction tuning
url https://doi.org/10.1007/s44267-024-00067-6
work_keys_str_mv AT zhangweigao miniinternvlaflexibletransferpocketmultimodalmodelwith5parametersand90performance
AT zhechen miniinternvlaflexibletransferpocketmultimodalmodelwith5parametersand90performance
AT erfeicui miniinternvlaflexibletransferpocketmultimodalmodelwith5parametersand90performance
AT yimingren miniinternvlaflexibletransferpocketmultimodalmodelwith5parametersand90performance
AT weiyunwang miniinternvlaflexibletransferpocketmultimodalmodelwith5parametersand90performance
AT jinguozhu miniinternvlaflexibletransferpocketmultimodalmodelwith5parametersand90performance
AT haotian miniinternvlaflexibletransferpocketmultimodalmodelwith5parametersand90performance
AT shenglongye miniinternvlaflexibletransferpocketmultimodalmodelwith5parametersand90performance
AT junjunhe miniinternvlaflexibletransferpocketmultimodalmodelwith5parametersand90performance
AT xizhouzhu miniinternvlaflexibletransferpocketmultimodalmodelwith5parametersand90performance
AT leweilu miniinternvlaflexibletransferpocketmultimodalmodelwith5parametersand90performance
AT tonglu miniinternvlaflexibletransferpocketmultimodalmodelwith5parametersand90performance
AT yuqiao miniinternvlaflexibletransferpocketmultimodalmodelwith5parametersand90performance
AT jifengdai miniinternvlaflexibletransferpocketmultimodalmodelwith5parametersand90performance
AT wenhaiwang miniinternvlaflexibletransferpocketmultimodalmodelwith5parametersand90performance