Ola: Pushing the Frontiers of Omni-Modal Language Model
☆389Jun 13, 2025Updated 9 months ago
Alternatives and similar repositories for Ola
Users that are interested in Ola are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- [ICLR 2025] MLLM for On-Demand Spatial-Temporal Understanding at Arbitrary Resolution☆330Jul 4, 2025Updated 8 months ago
- ✨✨[NeurIPS 2025] VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction☆2,500Mar 28, 2025Updated last year
- Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos☆69Sep 5, 2025Updated 6 months ago
- [CVPR2025 Highlight] Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models☆239Nov 7, 2025Updated 4 months ago
- ☆186Feb 8, 2025Updated last year
- Wordpress hosting with auto-scaling on Cloudways • AdFully Managed hosting built for WordPress-powered businesses that need reliable, auto-scalable hosting. Cloudways SafeUpdates now available.
- Baichuan-Omni: Towards Capable Open-source Omni-modal LLM 🌊☆273Jan 27, 2025Updated last year
- Align Anything: Training All-modality Model with Feedback☆4,638Nov 27, 2025Updated 4 months ago
- DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception☆159Dec 6, 2024Updated last year
- Streamable Text-to-Speech model using a language modeling approach, without vector quantization☆110May 20, 2025Updated 10 months ago
- Kimi-VL: Mixture-of-Experts Vision-Language Model for Multimodal Reasoning, Long-Context Understanding, and Strong Agent Capabilities☆1,171Jul 15, 2025Updated 8 months ago
- Chain-of-Spot: Interactive Reasoning Improves Large Vision-language Models☆99Mar 22, 2024Updated 2 years ago
- Your faithful, impartial partner for audio evaluation — know yourself, know your rivals. 真实评测,知己知彼。☆282Mar 19, 2026Updated 2 weeks ago
- Implementation for "The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer"☆80Oct 29, 2025Updated 5 months ago
- Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving stat…☆1,566Jun 14, 2025Updated 9 months ago
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- ✨✨Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy☆306May 14, 2025Updated 10 months ago
- A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.☆1,441Feb 11, 2026Updated last month
- [ECCV 2024] Efficient Inference of Vision Instruction-Following Models with Elastic Cache☆43Jul 26, 2024Updated last year
- [ICLR & NeurIPS 2025] Repository for Show-o series, One Single Transformer to Unify Multimodal Understanding and Generation.☆1,904Jan 8, 2026Updated 2 months ago
- Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and pe…☆3,966Jun 12, 2025Updated 9 months ago
- ☆22Feb 13, 2026Updated last month
- [ICCV 2025] Official repo for "GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation"☆202Jan 7, 2026Updated 2 months ago
- Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction☆219Feb 28, 2025Updated last year
- This repo contains evaluation code for the paper "AV-Odyssey: Can Your Multimodal LLMs Really Understand Audio-Visual Information?"☆31Dec 23, 2024Updated last year
- Wordpress hosting with auto-scaling on Cloudways • AdFully Managed hosting built for WordPress-powered businesses that need reliable, auto-scalable hosting. Cloudways SafeUpdates now available.
- A fork to add multimodal model training to open-r1☆1,514Feb 8, 2025Updated last year
- A Simple Framework of Small-scale LMMs for Video Understanding☆111Jun 11, 2025Updated 9 months ago
- VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs☆1,287Jan 23, 2025Updated last year
- Valley is a cutting-edge multimodal large model designed to handle a variety of tasks involving text, images, and video data.☆277Jan 20, 2026Updated 2 months ago
- [CVPR 2025] PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models☆52Jun 12, 2025Updated 9 months ago
- Next-Token Prediction is All You Need☆2,381Jan 12, 2026Updated 2 months ago
- ☆38Apr 3, 2025Updated 11 months ago
- Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities。☆1,876Jan 16, 2025Updated last year
- [arXiv: 2502.05178] QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation☆96Mar 1, 2025Updated last year
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- ☆4,615Sep 14, 2025Updated 6 months ago
- Code associated with the paper: CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition.☆17May 16, 2025Updated 10 months ago
- Frontier Multimodal Foundation Models for Image and Video Understanding☆1,131Aug 14, 2025Updated 7 months ago
- [ICML 2025] Streamline Without Sacrifice - Squeeze out Computation Redundancy in LMM☆20May 22, 2025Updated 10 months ago
- Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks☆3,958Mar 25, 2026Updated last week
- [CVPR 2025] EgoLife: Towards Egocentric Life Assistant☆406Mar 19, 2025Updated last year
- This is the official implementation of ICCV 2025 "Flash-VStream: Efficient Real-Time Understanding for Long Video Streams"☆274Oct 15, 2025Updated 5 months ago