NVlabs / VILA
VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops)
☆1,968Updated last week
Related projects ⓘ
Alternatives and complementary repositories for VILA
- ☆2,815Updated 3 weeks ago
- Cambrian-1 is a family of multimodal LLMs with a vision-centric design.☆1,749Updated last week
- Mixture-of-Experts for Large Vision-Language Models☆1,971Updated 5 months ago
- Repository for Meta Chameleon, a mixed-modal early-fusion foundation model from FAIR.☆1,823Updated 3 months ago
- DeepSeek-VL: Towards Real-World Vision-Language Understanding☆2,064Updated 6 months ago
- Next-Token Prediction is All You Need☆1,786Updated 2 weeks ago
- Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.☆2,983Updated last month
- 4M: Massively Multimodal Masked Modeling☆1,600Updated last month
- 【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection☆2,966Updated last month
- Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"☆3,206Updated 6 months ago
- 🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)☆807Updated 3 months ago
- Reaching LLaMA2 Performance with 0.1M Dollars☆960Updated 3 months ago
- Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation☆913Updated last week
- GPT4V-level open-source multi-modal model based on Llama3-8B☆2,100Updated 2 months ago
- Open-source evaluation toolkit of large vision-language models (LVLMs), support 160+ VLMs, 50+ benchmarks☆1,294Updated this week
- A family of lightweight multimodal models.☆928Updated 2 weeks ago
- VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs☆847Updated this week
- MiniSora: A community aims to explore the implementation path and future development direction of Sora.☆1,214Updated last month
- Mora: More like Sora for Generalist Video Generation☆1,513Updated 3 weeks ago
- Strong and Open Vision Language Assistant for Mobile Devices☆1,032Updated 6 months ago
- PyTorch code and models for V-JEPA self-supervised learning from video.☆2,664Updated 3 months ago
- A native PyTorch Library for large model training☆2,566Updated this week
- Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation☆1,297Updated 2 months ago
- VideoSys: An easy and efficient system for video generation☆1,761Updated this week
- LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills☆703Updated 9 months ago
- Codebase for Aria - an Open Multimodal Native MoE☆779Updated this week
- InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output☆2,509Updated 3 weeks ago
- API for Grounding DINO 1.5: IDEA Research's Most Capable Open-World Object Detection Model Series☆770Updated 3 months ago
- ICLR2024 Spotlight: curation/training code, metadata, distribution and pre-trained models for MetaCLIP; CVPR 2024: MoDE: CLIP Data Expert…☆1,245Updated 2 weeks ago