VITA-MLLM / VITA
✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
☆2,167Updated last month
Alternatives and similar repositories for VITA:
Users that are interested in VITA are comparing it to the libraries listed below
- The official repo of Qwen2-Audio chat & pretrained large audio language model proposed by Alibaba Cloud.☆1,618Updated 7 months ago
- ☆743Updated this week
- Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities。☆1,688Updated 2 months ago
- open-source multimodal large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming…☆3,226Updated 4 months ago
- GPT4V-level open-source multi-modal model based on Llama3-8B☆2,315Updated 3 weeks ago
- Next-Token Prediction is All You Need☆2,042Updated last week
- EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework based on veRL☆1,681Updated this week
- A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.☆838Updated last month
- GLM-4-Voice | 端到端中英语音对话模型☆2,787Updated 3 months ago
- VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs☆1,117Updated 2 months ago
- Frontier Multimodal Foundation Models for Image and Video Understanding☆664Updated this week
- 🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)☆835Updated 8 months ago
- ✨✨Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM☆294Updated 2 months ago
- InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions☆2,791Updated 2 months ago
- The official repo of Qwen-Audio (通义千问-Audio) chat & pretrained large audio language model proposed by Alibaba Cloud.☆1,637Updated 8 months ago
- ☆3,591Updated last month
- Official repository of ’Visual-RFT: Visual Reinforcement Fine-Tuning’☆1,384Updated this week
- ☆1,467Updated 3 months ago
- Align Anything: Training All-modality Model with Feedback☆2,967Updated this week
- Official code for Goldfish model for long video understanding and MiniGPT4-video for short video understanding☆603Updated 3 months ago
- [CVPR 2025] Open-source, End-to-end, Vision-Language-Action model for GUI Agent & Computer Use.☆1,117Updated last week
- An open-sourced end-to-end VLM-based GUI Agent☆837Updated last month
- Witness the aha moment of VLM with less than $3.☆3,376Updated 3 weeks ago
- Codebase for Aria - an Open Multimodal Native MoE☆1,025Updated 2 months ago
- Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks☆2,063Updated this week
- A fork to add multimodal model training to open-r1☆1,108Updated last month
- Code for "AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling"☆832Updated 6 months ago
- Parsing-free RAG supported by VLMs☆636Updated last month
- A family of lightweight multimodal models.☆1,006Updated 4 months ago
- ✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis☆489Updated this week