facebookresearch / MILS
Code release for "LLMs can see and hear without any training"
☆231Updated last month
Alternatives and similar repositories for MILS:
Users that are interested in MILS are comparing it to the libraries listed below
- Rethinking Step-by-step Visual Reasoning in LLMs☆282Updated 2 months ago
- Official GPU implementation of the paper "PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance"☆126Updated 4 months ago
- An open source implementation of CLIP (With TULIP Support)☆113Updated last week
- ☆366Updated last month
- Official PyTorch implementation of TokenSet.☆104Updated last week
- Long Context Transfer from Language to Vision☆368Updated 2 weeks ago
- Kyutai with an "eye"☆160Updated this week
- LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture☆200Updated 2 months ago
- [CVPR 2024] VCoder: Versatile Vision Encoders for Multimodal Large Language Models☆276Updated 11 months ago
- HPT - Open Multimodal LLMs from HyperGAI☆314Updated 9 months ago
- LLM2CLIP makes SOTA pretrained CLIP model more SOTA ever.☆495Updated last week
- SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models☆208Updated 6 months ago
- Official Implementation of "Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraini…☆551Updated 7 months ago
- A Unified Tokenizer for Visual Generation and Understanding☆216Updated 3 weeks ago
- Evaluate Image/Video Generation like Humans - Fast, Explainable, Flexible☆57Updated 3 months ago
- ☆169Updated 5 months ago
- [CVPR2025] Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models☆172Updated 2 weeks ago
- [ECCV 2024] Official PyTorch implementation code for realizing the technical part of Mixture of All Intelligence (MoAI) to improve perfor…☆319Updated last year
- Explore the Multimodal “Aha Moment” on 2B Model☆538Updated 2 weeks ago
- Anole: An Open, Autoregressive and Native Multimodal Models for Interleaved Image-Text Generation☆734Updated 7 months ago
- This is the official repository of our paper "What If We Recaption Billions of Web Images with LLaMA-3 ?"☆128Updated 9 months ago
- CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts☆145Updated 9 months ago
- Official repository of "GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing"☆179Updated last week
- Official repository for the paper PLLaVA☆644Updated 8 months ago
- Code for the Molmo Vision-Language Model☆348Updated 3 months ago
- [ICLR2025] LLaVA-HR: High-Resolution Large Language-Vision Assistant☆234Updated 7 months ago
- Multimodal Models in Real World☆455Updated last month
- [ICLR 2025] VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation☆249Updated 2 months ago
- LLaVA-Interactive-Demo☆367Updated 8 months ago
- ☆134Updated last month