NVlabs / VILA
VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and cloud.
☆3,201Updated last week
Alternatives and similar repositories for VILA:
Users that are interested in VILA are comparing it to the libraries listed below
- ☆3,763Updated 2 months ago
- Cambrian-1 is a family of multimodal LLMs with a vision-centric design.☆1,898Updated 6 months ago
- Next-Token Prediction is All You Need☆2,106Updated last month
- Repository for Meta Chameleon, a mixed-modal early-fusion foundation model from FAIR.☆1,989Updated 9 months ago
- InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions☆2,820Updated last week
- Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks☆2,304Updated this week
- LLaVA-CoT, a visual language model capable of spontaneous, systematic reasoning☆1,973Updated 3 weeks ago
- Mixture-of-Experts for Large Vision-Language Models☆2,152Updated 5 months ago
- ICLR2024 Spotlight: curation/training code, metadata, distribution and pre-trained models for MetaCLIP; CVPR 2024: MoDE: CLIP Data Expert…☆1,428Updated last month
- A suite of image and video neural tokenizers☆1,622Updated 2 months ago
- GPT4V-level open-source multi-modal model based on Llama3-8B☆2,341Updated 2 months ago
- 4M: Massively Multimodal Masked Modeling☆1,717Updated last month
- A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.☆900Updated last month
- Witness the aha moment of VLM with less than $3.☆3,622Updated 2 months ago
- This repository provides the code and model checkpoints for AIMv1 and AIMv2 research projects.☆1,275Updated last week
- PyTorch code and models for V-JEPA self-supervised learning from video.☆2,968Updated 2 months ago
- ✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction☆2,256Updated last month
- VideoSys: An easy and efficient system for video generation☆1,959Updated last month
- Famous Vision Language Models and Their Architectures☆803Updated 2 months ago
- Solve Visual Understanding with Reinforced VLMs☆4,860Updated 2 weeks ago
- Implementation for Describe Anything: Detailed Localized Image and Video Captioning☆908Updated this week
- Official repository for "AM-RADIO: Reduce All Domains Into One"☆1,133Updated last week
- 🔥🔥🔥Latest Papers, Codes and Datasets on Vid-LLMs.☆2,236Updated 3 months ago
- VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs☆1,152Updated 3 months ago
- 【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection☆3,238Updated 5 months ago
- Strong and Open Vision Language Assistant for Mobile Devices☆1,206Updated last year
- DeepSeek-VL: Towards Real-World Vision-Language Understanding☆3,806Updated last year
- SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer☆4,082Updated 2 weeks ago
- Grounded SAM 2: Ground and Track Anything in Videos with Grounding DINO, Florence-2 and SAM 2☆2,066Updated 2 weeks ago
- Official codebase used to develop Vision Transformer, SigLIP, MLP-Mixer, LiT and more.☆2,837Updated last month