VITA-MLLM / VITA
✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
☆1,932Updated this week
Alternatives and similar repositories for VITA:
Users that are interested in VITA are comparing it to the libraries listed below
- The official repo of Qwen2-Audio chat & pretrained large audio language model proposed by Alibaba Cloud.☆1,395Updated 5 months ago
- GPT4V-level open-source multi-modal model based on Llama3-8B☆2,204Updated 4 months ago
- Next-Token Prediction is All You Need☆1,965Updated 2 months ago
- open-source multimodal large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming…☆3,066Updated 2 months ago
- Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities。☆1,550Updated this week
- InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions☆2,709Updated 3 weeks ago
- A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.☆595Updated last month
- DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding☆797Updated this week
- ☆3,272Updated 3 months ago
- VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs☆1,013Updated last week
- Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.☆4,207Updated this week
- The official repo of Qwen-Audio (通义千问-Audio) chat & pretrained large audio language model proposed by Alibaba Cloud.☆1,553Updated 6 months ago
- Official code for Goldfish model for long video understanding and MiniGPT4-video for short video understanding☆581Updated last month
- Code for "AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling"☆818Updated 4 months ago
- Janus-Series: Unified Multimodal Understanding and Generation Models☆1,327Updated 2 months ago
- 🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)☆822Updated 6 months ago
- ☆1,044Updated this week
- GLM-4-Voice | 端到端中英语音对话 模型☆2,565Updated last month
- Open-source evaluation toolkit of large vision-language models (LVLMs), support 160+ VLMs, 50+ benchmarks☆1,689Updated this week
- An open-sourced end-to-end VLM-based GUI Agent☆513Updated last week
- Open-source, End-to-end, Vision-Language-Action model for GUI Agent & Computer Use.☆841Updated this week
- ✨✨Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM☆255Updated 2 weeks ago
- Cambrian-1 is a family of multimodal LLMs with a vision-centric design.☆1,823Updated 2 months ago
- ☆1,137Updated last month
- An efficient, flexible and full-featured toolkit for fine-tuning LLM (InternLM2, Llama3, Phi3, Qwen, Mistral, ...)☆4,158Updated this week
- mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding☆2,015Updated 3 weeks ago
- SEED-Story: Multimodal Long Story Generation with Large Language Model☆783Updated 3 months ago
- An Open Large Reasoning Model for Real-World Solutions☆1,378Updated last month
- VideoSys: An easy and efficient system for video generation☆1,875Updated 2 weeks ago
- LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve spee…☆2,746Updated 2 months ago