Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance
Abstract Multi-modal large language models (MLLMs) have demonstrated impressive performance in vision-language tasks across a wide range of domains. However, the large model scale and associated high computational cost pose significant challenges for training and deploying MLLMs on consumer-grade GP...
Saved in:
| Main Authors: | , , , , , , , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Springer
2024-12-01
|
| Series: | Visual Intelligence |
| Subjects: | |
| Online Access: | https://doi.org/10.1007/s44267-024-00067-6 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1846121760977584128 |
|---|---|
| author | Zhangwei Gao Zhe Chen Erfei Cui Yiming Ren Weiyun Wang Jinguo Zhu Hao Tian Shenglong Ye Junjun He Xizhou Zhu Lewei Lu Tong Lu Yu Qiao Jifeng Dai Wenhai Wang |
| author_facet | Zhangwei Gao Zhe Chen Erfei Cui Yiming Ren Weiyun Wang Jinguo Zhu Hao Tian Shenglong Ye Junjun He Xizhou Zhu Lewei Lu Tong Lu Yu Qiao Jifeng Dai Wenhai Wang |
| author_sort | Zhangwei Gao |
| collection | DOAJ |
| description | Abstract Multi-modal large language models (MLLMs) have demonstrated impressive performance in vision-language tasks across a wide range of domains. However, the large model scale and associated high computational cost pose significant challenges for training and deploying MLLMs on consumer-grade GPUs or edge devices, thereby hindering their widespread application. In this work, we introduce Mini-InternVL, a series of MLLMs with parameters ranging from 1 billion to 4 billion, which achieves 90% of the performance with only 5% of the parameters. This significant improvement in efficiency and effectiveness makes our models more accessible and applicable in various real-world scenarios. To further promote the adoption of our models, we are developing a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks, including autonomous driving, medical image processing, and remote sensing. We believe that our models can provide valuable insights and resources to advance the development of efficient and effective MLLMs. |
| format | Article |
| id | doaj-art-fb58e6071c0d4a7998ba8b9a8f4f8a61 |
| institution | Kabale University |
| issn | 2731-9008 |
| language | English |
| publishDate | 2024-12-01 |
| publisher | Springer |
| record_format | Article |
| series | Visual Intelligence |
| spelling | doaj-art-fb58e6071c0d4a7998ba8b9a8f4f8a612024-12-15T12:13:40ZengSpringerVisual Intelligence2731-90082024-12-012111710.1007/s44267-024-00067-6Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performanceZhangwei Gao0Zhe Chen1Erfei Cui2Yiming Ren3Weiyun Wang4Jinguo Zhu5Hao Tian6Shenglong Ye7Junjun He8Xizhou Zhu9Lewei Lu10Tong Lu11Yu Qiao12Jifeng Dai13Wenhai Wang14Shanghai AI LaboratoryShanghai AI LaboratoryShanghai AI LaboratoryShanghai AI LaboratoryShanghai AI LaboratoryShanghai AI LaboratorySenseTime ResearchShanghai AI LaboratoryShanghai AI LaboratoryDepartment of Electronic Engineering, Tsinghua UniversitySenseTime ResearchSchool of Computer Science, Nanjing UniversityShanghai AI LaboratoryDepartment of Electronic Engineering, Tsinghua UniversityDepartment of Information Engineering, The Chinese University of Hong KongAbstract Multi-modal large language models (MLLMs) have demonstrated impressive performance in vision-language tasks across a wide range of domains. However, the large model scale and associated high computational cost pose significant challenges for training and deploying MLLMs on consumer-grade GPUs or edge devices, thereby hindering their widespread application. In this work, we introduce Mini-InternVL, a series of MLLMs with parameters ranging from 1 billion to 4 billion, which achieves 90% of the performance with only 5% of the parameters. This significant improvement in efficiency and effectiveness makes our models more accessible and applicable in various real-world scenarios. To further promote the adoption of our models, we are developing a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks, including autonomous driving, medical image processing, and remote sensing. We believe that our models can provide valuable insights and resources to advance the development of efficient and effective MLLMs.https://doi.org/10.1007/s44267-024-00067-6Lightweight multi-modal large language modelVision-language modelKnowledge distillationVisual instruction tuning |
| spellingShingle | Zhangwei Gao Zhe Chen Erfei Cui Yiming Ren Weiyun Wang Jinguo Zhu Hao Tian Shenglong Ye Junjun He Xizhou Zhu Lewei Lu Tong Lu Yu Qiao Jifeng Dai Wenhai Wang Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance Visual Intelligence Lightweight multi-modal large language model Vision-language model Knowledge distillation Visual instruction tuning |
| title | Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance |
| title_full | Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance |
| title_fullStr | Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance |
| title_full_unstemmed | Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance |
| title_short | Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance |
| title_sort | mini internvl a flexible transfer pocket multi modal model with 5 parameters and 90 performance |
| topic | Lightweight multi-modal large language model Vision-language model Knowledge distillation Visual instruction tuning |
| url | https://doi.org/10.1007/s44267-024-00067-6 |
| work_keys_str_mv | AT zhangweigao miniinternvlaflexibletransferpocketmultimodalmodelwith5parametersand90performance AT zhechen miniinternvlaflexibletransferpocketmultimodalmodelwith5parametersand90performance AT erfeicui miniinternvlaflexibletransferpocketmultimodalmodelwith5parametersand90performance AT yimingren miniinternvlaflexibletransferpocketmultimodalmodelwith5parametersand90performance AT weiyunwang miniinternvlaflexibletransferpocketmultimodalmodelwith5parametersand90performance AT jinguozhu miniinternvlaflexibletransferpocketmultimodalmodelwith5parametersand90performance AT haotian miniinternvlaflexibletransferpocketmultimodalmodelwith5parametersand90performance AT shenglongye miniinternvlaflexibletransferpocketmultimodalmodelwith5parametersand90performance AT junjunhe miniinternvlaflexibletransferpocketmultimodalmodelwith5parametersand90performance AT xizhouzhu miniinternvlaflexibletransferpocketmultimodalmodelwith5parametersand90performance AT leweilu miniinternvlaflexibletransferpocketmultimodalmodelwith5parametersand90performance AT tonglu miniinternvlaflexibletransferpocketmultimodalmodelwith5parametersand90performance AT yuqiao miniinternvlaflexibletransferpocketmultimodalmodelwith5parametersand90performance AT jifengdai miniinternvlaflexibletransferpocketmultimodalmodelwith5parametersand90performance AT wenhaiwang miniinternvlaflexibletransferpocketmultimodalmodelwith5parametersand90performance |