| --- |
| license: cc-by-nc-4.0 |
| library_name: transformers |
| pipeline_tag: text-generation |
| tags: |
| - VILA |
| - VLM |
| --- |
| |
| # VILA Model Card |
|
|
| ## Model details |
|
|
| **Model type:** |
| VILA is a visual language model (VLM) pretrained with interleaved image-text data at scale, enabling multi-image VLM. VILA is deployable on the edge, including Jetson Orin and laptop by AWQ 4bit quantization through TinyChat framework. We find: (1) image-text pairs are not enough, interleaved image-text is essential; (2) unfreezing LLM during interleaved image-text pre-training enables in-context learning; (3)re-blending text-only instruction data is crucial to boost both VLM and text-only performance. VILA unveils appealing capabilities, including: multi-image reasoning, in-context learning, visual chain-of-thought, and better world knowledge. |
|
|
| **Model date:** |
| VILA-7b was trained in Feb 2024. |
|
|
| **Paper or resources for more information:** |
| https://github.com/Efficient-Large-Model/VILA |
|
|
| ``` |
| @misc{lin2023vila, |
| title={VILA: On Pre-training for Visual Language Models}, |
| author={Ji Lin and Hongxu Yin and Wei Ping and Yao Lu and Pavlo Molchanov and Andrew Tao and Huizi Mao and Jan Kautz and Mohammad Shoeybi and Song Han}, |
| year={2023}, |
| eprint={2312.07533}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.CV} |
| } |
| ``` |
|
|
| ## License |
| - The code is released under the Apache 2.0 license as found in the [LICENSE](./LICENSE) file. |
| - The pretrained weights are released under the [CC-BY-NC-SA-4.0 license](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en). |
| - The service is a research preview intended for non-commercial use only, and is subject to the following licenses and terms: |
| - [Model License](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md) of LLaMA |
| - [Terms of Use](https://openai.com/policies/terms-of-use) of the data generated by OpenAI |
| - [Dataset Licenses](https://github.com/Efficient-Large-Model/VILA/blob/main/data_prepare/LICENSE) for each one used during training. |
|
|
| **Where to send questions or comments about the model:** |
| https://github.com/Efficient-Large-Model/VILA/issues |
|
|
| ## Intended use |
| **Primary intended uses:** |
| The primary use of VILA is research on large multimodal models and chatbots. |
|
|
| **Primary intended users:** |
| The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence. |
|
|
| ## Training dataset |
| See [Dataset Preparation](https://github.com/Efficient-Large-Model/VILA/blob/main/data_prepare/README.md) for more details. |
|
|
| ## Evaluation dataset |
| A collection of 12 benchmarks, including 5 academic VQA benchmarks and 7 recent benchmarks specifically proposed for instruction-following LMMs. |