Ola-Omni / OlaView external linksLinks
Ola: Pushing the Frontiers of Omni-Modal Language Model
☆385Jun 13, 2025Updated 8 months ago
Alternatives and similar repositories for Ola
Users that are interested in Ola are comparing it to the libraries listed below
Sorting:
- [ICLR 2025] MLLM for On-Demand Spatial-Temporal Understanding at Arbitrary Resolution☆331Jul 4, 2025Updated 7 months ago
- ✨✨[NeurIPS 2025] VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction☆2,487Mar 28, 2025Updated 10 months ago
- [CVPR2025 Highlight] Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models☆233Nov 7, 2025Updated 3 months ago
- Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos☆66Sep 5, 2025Updated 5 months ago
- DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception☆159Dec 6, 2024Updated last year
- Kimi-VL: Mixture-of-Experts Vision-Language Model for Multimodal Reasoning, Long-Context Understanding, and Strong Agent Capabilities☆1,156Jul 15, 2025Updated 7 months ago
- Your faithful, impartial partner for audio evaluation — know yourself, know your rivals. 真实评测,知己知彼。☆275Feb 3, 2026Updated 2 weeks ago
- ✨✨Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy☆305May 14, 2025Updated 9 months ago
- ☆185Feb 8, 2025Updated last year
- Streamable Text-to-Speech model using a language modeling approach, without vector quantization☆110May 20, 2025Updated 8 months ago
- Baichuan-Omni: Towards Capable Open-source Omni-modal LLM 🌊☆272Jan 27, 2025Updated last year
- [CVPR 2025] PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models☆51Jun 12, 2025Updated 8 months ago
- Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving stat…☆1,544Jun 14, 2025Updated 8 months ago
- [ICCV 2025] Official repo for "GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation"☆198Jan 7, 2026Updated last month
- A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.☆1,430Updated this week
- A Simple Framework of Small-scale LMMs for Video Understanding☆108Jun 11, 2025Updated 8 months ago
- Official repo of the ICLR 2025 paper "MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos"☆28Jul 15, 2025Updated 7 months ago
- Align Anything: Training All-modality Model with Feedback☆4,632Nov 27, 2025Updated 2 months ago
- VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs☆1,277Jan 23, 2025Updated last year
- [ICLR & NeurIPS 2025] Repository for Show-o series, One Single Transformer to Unify Multimodal Understanding and Generation.☆1,876Jan 8, 2026Updated last month
- Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and pe…☆3,919Jun 12, 2025Updated 8 months ago
- WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs☆38Jan 26, 2026Updated 3 weeks ago
- Frontier Multimodal Foundation Models for Image and Video Understanding☆1,102Aug 14, 2025Updated 6 months ago
- ✨✨Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM☆365May 27, 2025Updated 8 months ago
- Valley is a cutting-edge multimodal large model designed to handle a variety of tasks involving text, images, and video data.☆271Jan 20, 2026Updated 3 weeks ago
- Implementation for "The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer"☆78Oct 29, 2025Updated 3 months ago
- This repo contains evaluation code for the paper "AV-Odyssey: Can Your Multimodal LLMs Really Understand Audio-Visual Information?"☆31Dec 23, 2024Updated last year
- A fork to add multimodal model training to open-r1☆1,474Feb 8, 2025Updated last year
- Next-Token Prediction is All You Need☆2,345Jan 12, 2026Updated last month
- Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction☆217Feb 28, 2025Updated 11 months ago
- Solve Visual Understanding with Reinforced VLMs☆5,841Oct 21, 2025Updated 3 months ago
- [ICCV 2025 Highlight] The official repository for "2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining"☆191Mar 17, 2025Updated 11 months ago
- Official implementation of BLIP3o-Series☆1,638Nov 29, 2025Updated 2 months ago
- This is the official implementation of ICCV 2025 "Flash-VStream: Efficient Real-Time Understanding for Long Video Streams"☆269Oct 15, 2025Updated 4 months ago
- Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks☆3,816Updated this week
- [CVPR'25 highlight] RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness☆444May 14, 2025Updated 9 months ago
- Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities。☆1,869Jan 16, 2025Updated last year
- VideoNIAH: A Flexible Synthetic Method for Benchmarking Video MLLMs☆54Mar 9, 2025Updated 11 months ago
- EVE Series: Encoder-Free Vision-Language Models from BAAI☆368Jul 24, 2025Updated 6 months ago